1- Introduction¶
This dataset captures the purchasing behavior of 9,506 online clients for a major American retailer, Vanguard, over the past 12 months. It belongs to Retail and E-commerce industry. Overall, it provides a comprehensive view of online shopping behavior and customer demographics, essential for analyzing purchasing trends and enhancing strategic decisions for a retail company’s online presence.
Here is the description of the dataset and the variables:
Important Assumption¶
The dataset includes various demographic details, customer segmentation, and purchasing patterns. However, there is an issue with the dataset: the documentation needed to understand the customer segmentation variables and their respective classes is inadequate. There is only sufficient clarity on the variables that indicate customer purchasing behavior and demographics. However, choosing any of the customer segmentation variables as the target variable would be most appropriate for making predictions, as these are the effects of the causal variables—purchasing patterns and demographics. Therefore, to meet the project requirements within the available time constraints, a robust segmentation model is being developed with 'segment_1' as the outcome variable. Details of its respective classes will be derived later in stakeholder meetings.
2- Exploratory Data Analysis¶
loading the dataset and necessary packages for performing EDA.
# Install required libraries
! pip install numpy pandas matplotlib seaborn scikit-learn openpyxl
Requirement already satisfied: numpy in c:\users\palad\anaconda3\lib\site-packages (1.26.4) Requirement already satisfied: pandas in c:\users\palad\anaconda3\lib\site-packages (2.1.4) Requirement already satisfied: matplotlib in c:\users\palad\anaconda3\lib\site-packages (3.8.4) Requirement already satisfied: seaborn in c:\users\palad\anaconda3\lib\site-packages (0.13.2) Requirement already satisfied: scikit-learn in c:\users\palad\anaconda3\lib\site-packages (1.2.2) Requirement already satisfied: openpyxl in c:\users\palad\anaconda3\lib\site-packages (3.0.10) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\palad\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\palad\anaconda3\lib\site-packages (from pandas) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\palad\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: contourpy>=1.0.1 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (4.25.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=8 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\palad\anaconda3\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: scipy>=1.3.2 in c:\users\palad\anaconda3\lib\site-packages (from scikit-learn) (1.11.4) Requirement already satisfied: joblib>=1.1.1 in c:\users\palad\anaconda3\lib\site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\palad\anaconda3\lib\site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: et_xmlfile in c:\users\palad\anaconda3\lib\site-packages (from openpyxl) (1.1.0) Requirement already satisfied: six>=1.5 in c:\users\palad\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
# Libraries for data manipulation
import numpy as np
import pandas as pd
# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Improve the aesthetics of the visualizations
sns.set()
# Configuration settings for display options
pd.set_option("display.max_columns", None) # No limit on the number of displayed columns
pd.set_option("display.max_rows", 200) # Display up to 200 rows
# Suppress warnings for cleaner output (consider being more selective with warnings to ignore)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Load the dataset from a specified file path
df = pd.read_excel(r"C:\Users\palad\Downloads\ONLINE_CLIENTS_SV.xlsx", sheet_name='DB')
# Show the first few rows of the dataset to verify it's loaded correctly
df.head()
| CLIENT_ID | CUMMSALES_LAST12WEEKS | FREQUENCY_LAST12WEEKS | AVERAGE_TICKET | RECENCY | CONSISTENCY | BRANCH | SEGMENT_1 | LOYALTY_GROUP | PRICE_GROUP | SEGMENT_2 | GENDER | MARITAL_STATUS | BIRTH_DATE | AGE | MOSTUSED_PLATFORM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22335 | 7516.357 | 10 | 751.6357 | 11 | 8 | 2979 | Core | Vip | Very Price Sensitive | B | Female | Married | 1973-11-12 | 44.569863 | Mobile |
| 1 | 22349 | 860.535 | 1 | 860.5350 | 49 | 1 | 2979 | Core | Ocasional | Selective Price Sensitive | B | Female | Married | 1988-04-24 | 30.112329 | Web |
| 2 | 22389 | 1576.317 | 2 | 788.1585 | 74 | 1 | 2979 | Core | Ocasional | Very Price Sensitive | B | Female | Married | 1977-01-15 | 41.391781 | Mobile |
| 3 | 22679 | 4531.182 | 3 | 1510.3940 | 24 | 2 | 2961 | Core | Ocasional | Moderately Price Sensitive | B | Male | Married | 1987-05-20 | 31.043836 | Mobile |
| 4 | 22878 | 6193.583 | 1 | 6193.5830 | 70 | 1 | 2979 | Core | Ocasional | Selective Price Sensitive | B | Male | Married | 1968-09-07 | 49.753425 | Web |
# checking shape of the data
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.")
There are 9504 rows and 16 columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9504 entries, 0 to 9503 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENT_ID 9504 non-null int64 1 CUMMSALES_LAST12WEEKS 9504 non-null float64 2 FREQUENCY_LAST12WEEKS 9504 non-null int64 3 AVERAGE_TICKET 9504 non-null float64 4 RECENCY 9504 non-null int64 5 CONSISTENCY 9504 non-null int64 6 BRANCH 9504 non-null int64 7 SEGMENT_1 9504 non-null object 8 LOYALTY_GROUP 9504 non-null object 9 PRICE_GROUP 9504 non-null object 10 SEGMENT_2 9504 non-null object 11 GENDER 9503 non-null object 12 MARITAL_STATUS 9503 non-null object 13 BIRTH_DATE 8346 non-null datetime64[ns] 14 AGE 8346 non-null float64 15 MOSTUSED_PLATFORM 9504 non-null object dtypes: datetime64[ns](1), float64(3), int64(5), object(7) memory usage: 1.2+ MB
Dropping Redundant Variables¶
Overall, columns like Client_ID, Birth_Date are redundant for this particular dataset analysis. The reasons are: Client ID is just a unique identifier, the presence of column 'Age' makes 'Birth_date' not so effective to use, especially when the project's goal is not focussed on time series or trends over time. These are to be dropped. Then, Branch which represents branch where a transaction happened is supposed to be in factor type. Also lets rename CUMMSALES_LAST12WEEKS to CUMSALES and FREQUENCY_LAST12WEEKS to FREQUENCY.
# To retain the original DataFrame, a copy is made.
data = df.copy()
# Drop redundant columns and create a new DataFrame 'data' with the remaining columns
data = data.drop(['CLIENT_ID', 'BIRTH_DATE'], axis=1)
#Change Branch into appropriate datatype
data['BRANCH'] = data['BRANCH'].astype('object')
#renaming columns for brevity and unnecessary distractions
data.rename(columns={'CUMMSALES_LAST12WEEKS': 'CUMSALES', 'FREQUENCY_LAST12WEEKS': 'FREQUENCY'}, inplace=True)
data.head()
| CUMSALES | FREQUENCY | AVERAGE_TICKET | RECENCY | CONSISTENCY | BRANCH | SEGMENT_1 | LOYALTY_GROUP | PRICE_GROUP | SEGMENT_2 | GENDER | MARITAL_STATUS | AGE | MOSTUSED_PLATFORM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7516.357 | 10 | 751.6357 | 11 | 8 | 2979 | Core | Vip | Very Price Sensitive | B | Female | Married | 44.569863 | Mobile |
| 1 | 860.535 | 1 | 860.5350 | 49 | 1 | 2979 | Core | Ocasional | Selective Price Sensitive | B | Female | Married | 30.112329 | Web |
| 2 | 1576.317 | 2 | 788.1585 | 74 | 1 | 2979 | Core | Ocasional | Very Price Sensitive | B | Female | Married | 41.391781 | Mobile |
| 3 | 4531.182 | 3 | 1510.3940 | 24 | 2 | 2961 | Core | Ocasional | Moderately Price Sensitive | B | Male | Married | 31.043836 | Mobile |
| 4 | 6193.583 | 1 | 6193.5830 | 70 | 1 | 2979 | Core | Ocasional | Selective Price Sensitive | B | Male | Married | 49.753425 | Web |
Null Values Detected¶
data.isnull().sum()
CUMSALES 0 FREQUENCY 0 AVERAGE_TICKET 0 RECENCY 0 CONSISTENCY 0 BRANCH 0 SEGMENT_1 0 LOYALTY_GROUP 0 PRICE_GROUP 0 SEGMENT_2 0 GENDER 1 MARITAL_STATUS 1 AGE 1158 MOSTUSED_PLATFORM 0 dtype: int64
There are just 1 each missing values in Gender and Marital_status columns. Considering more than 9000 observations, dropping 2 rows would not make troublesome difference. Whereas, the Age column has more than 10% (1158) of missing null values of the total rows (9000+), for that reason dropping is not a good option as precious data that could be useful for accurate predictions could be lost. So, we shall decide which specific imputation to perform on it as we go on with EDA in further steps.
Anamalies/Outliers Detected¶
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CUMSALES | 9504.0 | 3749.918529 | 7057.653117 | 2.500000 | 694.732750 | 1713.387000 | 4473.282250 | 279970.140000 |
| FREQUENCY | 9504.0 | 3.216961 | 4.106171 | 1.000000 | 1.000000 | 2.000000 | 4.000000 | 135.000000 |
| AVERAGE_TICKET | 9504.0 | 1248.730602 | 2406.863554 | 2.500000 | 482.148824 | 857.233875 | 1383.428333 | 130698.600000 |
| RECENCY | 9504.0 | 28.771044 | 24.240985 | 0.000000 | 7.000000 | 21.000000 | 48.000000 | 83.000000 |
| CONSISTENCY | 9504.0 | 2.793876 | 2.611770 | 1.000000 | 1.000000 | 2.000000 | 4.000000 | 12.000000 |
| AGE | 8346.0 | 35.985055 | 10.036625 | 2.350685 | 29.595890 | 34.783562 | 40.599315 | 97.810959 |
This indicates anamolies in the above continuous variables. EDA can point this out easily.
#summary for categorical variables
data.describe(include= 'object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| BRANCH | 9504 | 25 | 2978 | 928 |
| SEGMENT_1 | 9504 | 2 | Up | 5643 |
| LOYALTY_GROUP | 9504 | 4 | Ocasional | 6910 |
| PRICE_GROUP | 9504 | 5 | Very Price Insensitive | 2515 |
| SEGMENT_2 | 9504 | 6 | A | 2589 |
| GENDER | 9503 | 2 | Female | 7445 |
| MARITAL_STATUS | 9503 | 3 | Married | 5300 |
| MOSTUSED_PLATFORM | 9504 | 3 | Web | 4929 |
The column BRANCH also has to be dropped as it is unlikely to add value to the model. The target variable 'SEGMENT_1' can also be determined by other key variables related to customer behavior and purchase patterns. Including 'BRANCH', with its multiple classes (25), which are not in a state of merging due to inadequate information, would introduce unnecessary complexity without contributing significantly to the prediction, potentially leading to inefficiency or overfitting. Also, let's see if there is any useful possibility of reducing the variables with 5 and 6 classes after EDA.
#Drop Branch column
data = data.drop(['BRANCH'], axis=1)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9504 entries, 0 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 9504 non-null float64 1 FREQUENCY 9504 non-null int64 2 AVERAGE_TICKET 9504 non-null float64 3 RECENCY 9504 non-null int64 4 CONSISTENCY 9504 non-null int64 5 SEGMENT_1 9504 non-null object 6 LOYALTY_GROUP 9504 non-null object 7 PRICE_GROUP 9504 non-null object 8 SEGMENT_2 9504 non-null object 9 GENDER 9503 non-null object 10 MARITAL_STATUS 9503 non-null object 11 AGE 8346 non-null float64 12 MOSTUSED_PLATFORM 9504 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 965.4+ KB
Univariate Visuals¶
def plot_histogram_boxplot(data, feature):
"""
Plots a histogram and a boxplot for the specified feature in the data.
Adds mean and median lines to the histogram.
Args:
- data: DataFrame containing the data.
- feature: String representing the column to plot.
"""
fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (0.2, 0.8)}, figsize=(12, 8))
# Boxplot
sns.boxplot(x=data[feature], ax=ax_box)
ax_box.set(xlabel='')
# Histogram
sns.histplot(data[feature], kde=True, ax=ax_hist)
mean = data[feature].mean()
median = data[feature].median()
# Adding mean and median lines
ax_hist.axvline(mean, color='r', linestyle='--', linewidth=2)
ax_hist.axvline(median, color='g', linestyle='-', linewidth=2)
# Labels
ax_hist.legend({'Mean': mean, 'Median': median})
ax_hist.set(title=f'{feature} Distribution')
ax_hist.set(xlabel=feature, ylabel='Frequency')
plt.show()
# Plotting cumsales
plot_histogram_boxplot(data, 'CUMSALES')
Presence of more outliers. Not good for model.
# Plotting frequency
plot_histogram_boxplot(data, 'FREQUENCY')
Outliers present.
# Plotting average ticket
plot_histogram_boxplot(data, 'AVERAGE_TICKET')
Outliers present.
# Plotting recency
plot_histogram_boxplot(data, 'RECENCY')
Recency looks good. Remember it is number of days since last purchase, not months, so fine. Right skewed.
# Plotting consistecency
plot_histogram_boxplot(data, 'CONSISTENCY')
Imputation of NA in Age variable¶
Most of the customers fall into the segment visiting once or twice in an year, rarely there were who visited most of the year. They act as outliers.
# Plotting age
plot_histogram_boxplot(data, 'AGE')
There is very less chance that anyone below 15 year old places an order. So, outliers on the left of distribution are definitely to be dropped. There are also people who are old which is quite probable considering their proportion. So, lets keep them.
# Calculate the first quartile (Q1) and third quartile (Q3)
Q1_age = data['AGE'].quantile(0.25)
Q3_age = data['AGE'].quantile(0.75)
# Calculate the Interquartile Range (IQR)
IQR_age = Q3_age - Q1_age
# Define the lower bound (to remove left-side outliers)
lower_bound_age = Q1_age - 1.5 * IQR_age
# Filter data to keep rows where AGE is above the lower bound, but keep NaN values
data = data[(data['AGE'] >= lower_bound_age) | (data['AGE'].isna())]
# Plotting age
plot_histogram_boxplot(data, 'AGE')
data.isnull().sum()
CUMSALES 0 FREQUENCY 0 AVERAGE_TICKET 0 RECENCY 0 CONSISTENCY 0 SEGMENT_1 0 LOYALTY_GROUP 0 PRICE_GROUP 0 SEGMENT_2 0 GENDER 1 MARITAL_STATUS 1 AGE 1158 MOSTUSED_PLATFORM 0 dtype: int64
Ofcourse, as we see, variables like age tend to be right skewed. So, let's impute median age value for filling in the more than 10% of those missing values in the AGE variable. Also drop the other rows of missing values in GENDER and MARITAL_STATUS columns as they are just 1 each.
# Impute the median for missing values in 'AGE' column
median_age = data['AGE'].median()
data['AGE'].fillna(median_age, inplace=True)
# Drop rows where 'GENDER' or 'MARITAL_STATUS' is missing
data.dropna(subset=['GENDER', 'MARITAL_STATUS'], inplace=True)
# To verify if the missing values have been handled
print(data.isnull().sum())
CUMSALES 0 FREQUENCY 0 AVERAGE_TICKET 0 RECENCY 0 CONSISTENCY 0 SEGMENT_1 0 LOYALTY_GROUP 0 PRICE_GROUP 0 SEGMENT_2 0 GENDER 0 MARITAL_STATUS 0 AGE 0 MOSTUSED_PLATFORM 0 dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 9454 entries, 0 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 9454 non-null float64 1 FREQUENCY 9454 non-null int64 2 AVERAGE_TICKET 9454 non-null float64 3 RECENCY 9454 non-null int64 4 CONSISTENCY 9454 non-null int64 5 SEGMENT_1 9454 non-null object 6 LOYALTY_GROUP 9454 non-null object 7 PRICE_GROUP 9454 non-null object 8 SEGMENT_2 9454 non-null object 9 GENDER 9454 non-null object 10 MARITAL_STATUS 9454 non-null object 11 AGE 9454 non-null float64 12 MOSTUSED_PLATFORM 9454 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 1.0+ MB
Thus, the dataset is free of null values. Let's go ahead by performing EDA on factor variables.
#defining plotting fucntions for categorical variables
def bar_plot(data, feature, figsize=(14, 6), order=None):
"""
Bar plot for categorical variables
data: dataframe
feature: dataframe column
figsize: size of figure (default (10,6))
order: order of categories (default None)
"""
plt.figure(figsize=figsize)
sns.countplot(data=data, x=feature, order=order, palette="viridis")
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.show()
#plotting segment_1
bar_plot(data, 'SEGMENT_1')
Our target variable Segment_1 has less class imbalance. Good sign.
#plotting loyalty_group
bar_plot(data, 'LOYALTY_GROUP')
4 classes in Loayly-group is totally okay.
#plotting price_group
bar_plot(data, 'PRICE_GROUP')
All classes indicate their own group. No need to club any of them.
#plotting segment_2
bar_plot(data, 'SEGMENT_2')
There are a good proportions of observations in each of the 6 classes, lets's not drop and retain all of them to see how the model performs.
categorical_columns = ['GENDER', 'MARITAL_STATUS', 'MOSTUSED_PLATFORM']
# Plot categorical variables
for column in categorical_columns:
bar_plot(data, column)
In the MOSTUSED_PLATFORM, Mobile and Phone are same lets club them.
#replacing by phone with mobile
data['MOSTUSED_PLATFORM'] = data['MOSTUSED_PLATFORM'].replace({
'By Phone': 'Mobile'
})
bar_plot(data, 'MOSTUSED_PLATFORM')
Outlier Removal¶
Time to remove outliers from all numercial columns except AGE which was already taken care of in previous steps. This will ensure that the model understands the patterns well and reduces overfitting because such cases of outlier occurences (for example, customers with very high annual sales amount) are less common. But yes, this comes at a adding a small bias of not able to capture those outlier cases effectively. But that's okay, their number is way lesser than average/general case numbers in the dataset.
def remove_outliers_all_but_age(data):
"""
Removes outliers from all numerical columns in the DataFrame based on the IQR method,
except the 'AGE' column.
Args:
- data: DataFrame containing the data.
Returns:
- A new DataFrame with outliers removed from all numerical columns except 'AGE'.
"""
# Create a copy of the data to preserve the original DataFrame
clean_data = data.copy()
# Loop through all numerical columns except 'AGE'
for feature in clean_data.select_dtypes(include=['float64', 'int64']).columns:
if feature == 'AGE':
continue # Skip the 'AGE' column
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = clean_data[feature].quantile(0.25)
Q3 = clean_data[feature].quantile(0.75)
IQR = Q3 - Q1
# Define bounds for the outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out the outliers
clean_data = clean_data[(clean_data[feature] >= lower_bound) & (clean_data[feature] <= upper_bound)]
return clean_data
# Apply the function to remove outliers from all numerical columns except 'AGE'
clean_data = remove_outliers_all_but_age(data)
# Check the resulting DataFrame after outlier removal
clean_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null int64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null int64 4 CONSISTENCY 7561 non-null int64 5 SEGMENT_1 7561 non-null object 6 LOYALTY_GROUP 7561 non-null object 7 PRICE_GROUP 7561 non-null object 8 SEGMENT_2 7561 non-null object 9 GENDER 7561 non-null object 10 MARITAL_STATUS 7561 non-null object 11 AGE 7561 non-null float64 12 MOSTUSED_PLATFORM 7561 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 827.0+ KB
Visualizing after outlier removals
numerical_columns = ['CUMSALES', 'FREQUENCY', 'AVERAGE_TICKET', 'RECENCY', 'CONSISTENCY']
# Plot numerical variables
for column in numerical_columns:
plot_histogram_boxplot(clean_data, column)
Scaling Features¶
It is time to scale the data by standardization. By keeping features on similar scales, consistency across models improves, especially for models like Logistic Regression, SVM, and Lasso, which are sensitive to feature magnitudes. Scaling the features is crucial before model fitting to ensure that features with higher magnitudes don't disproportionately influence the model, preventing skewed coefficient predictions. But, before this we shall check if it is better to divide age into bins or take it as a continuous variable by plotting bivariate plots that could drive this decision home.
# Set up the figure with 2 subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
# Plot AGE distribution for 'Core' class
sns.histplot(clean_data[clean_data['SEGMENT_1'] == 'Core']['AGE'], bins=20, kde=True, ax=axes[0], color='blue')
axes[0].set_title('AGE Distribution (Core)')
# Plot AGE distribution for 'Up' class
sns.histplot(clean_data[clean_data['SEGMENT_1'] == 'Up']['AGE'], bins=20, kde=True, ax=axes[1], color='orange')
axes[1].set_title('AGE Distribution (Up)')
# Show the plots
plt.tight_layout()
plt.show()
Based on the distributions shown in the image, it seems that the age distributions for both "Core" and "Up" classes are quite similar, with peaks around the same age range (30-40). The "Up" class has a slightly higher peak but fewer total samples. Since there’s no clear visual separation or distinct thresholds between the age ranges of the two groups, binning may not add value. Keeping AGE continuous allows the model to capture any subtle differences that binning might oversimplify or overlook.
clean_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null int64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null int64 4 CONSISTENCY 7561 non-null int64 5 SEGMENT_1 7561 non-null object 6 LOYALTY_GROUP 7561 non-null object 7 PRICE_GROUP 7561 non-null object 8 SEGMENT_2 7561 non-null object 9 GENDER 7561 non-null object 10 MARITAL_STATUS 7561 non-null object 11 AGE 7561 non-null float64 12 MOSTUSED_PLATFORM 7561 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 827.0+ KB
from sklearn.preprocessing import StandardScaler
# Create a copy of your cleaned data before scaling
scaled_data = clean_data.copy()
# List of numerical features to scale (excluding 'AGE' if needed)
num_features = ['CUMSALES', 'FREQUENCY', 'AVERAGE_TICKET', 'RECENCY', 'CONSISTENCY', 'AGE']
# Initialize the StandardScaler
scaler = StandardScaler()
# Apply the scaler to the numerical features
scaled_data[num_features] = scaler.fit_transform(clean_data[num_features])
# Check the scaled data (optional)
scaled_data[num_features].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CUMSALES | 7561.0 | 5.638482e-17 | 1.000066 | -0.995693 | -0.707614 | -0.346994 | 0.324629 | 4.344515 |
| FREQUENCY | 7561.0 | 6.014381e-17 | 1.000066 | -0.748322 | -0.748322 | -0.748322 | 0.655490 | 2.761207 |
| AVERAGE_TICKET | 7561.0 | 7.517976e-18 | 1.000066 | -1.543541 | -0.798491 | -0.189948 | 0.621031 | 3.122194 |
| RECENCY | 7561.0 | -9.303496e-17 | 1.000066 | -1.330001 | -0.878471 | -0.262749 | 0.845550 | 2.076995 |
| CONSISTENCY | 7561.0 | 6.014381e-17 | 1.000066 | -0.724761 | -0.724761 | -0.724761 | 0.786315 | 3.052928 |
| AGE | 7561.0 | 1.578775e-16 | 1.000066 | -2.080734 | -0.598641 | -0.110166 | 0.393158 | 6.855153 |
Now, time for looking for bivariate relationships between features, as target is already a factor variable.
scaled_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 SEGMENT_1 7561 non-null object 6 LOYALTY_GROUP 7561 non-null object 7 PRICE_GROUP 7561 non-null object 8 SEGMENT_2 7561 non-null object 9 GENDER 7561 non-null object 10 MARITAL_STATUS 7561 non-null object 11 AGE 7561 non-null float64 12 MOSTUSED_PLATFORM 7561 non-null object dtypes: float64(6), object(7) memory usage: 827.0+ KB
Bivariate Visuals¶
# Pair plot with 'SEGMENT_1' as the hue (color-coded by target class)
sns.pairplot(scaled_data, hue='SEGMENT_1', diag_kind='kde', height=2)
plt.show()
Based on the pair plot, the relationships between most variables and the target class (SEGMENT_1) appear to be non-linear. For instance, variables such as CUMSALES, FREQUENCY, and RECENCY do not exhibit clear linear trends, with the scatter plots showing more dispersed, non-linear patterns. Additionally, features like AVERAGE_TICKET and AGE demonstrate clustering without forming straight-line relationships with other variables. Therefore, using non-linear models such as decision trees, random forests, or boosting techniques etc might be more suitable for capturing these complex relationships effectively. Anyways, lets build all models and see if this turns out to be true, but no need of correlation matrix as it does not any sense here because of non-linear relationships.
scaled_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 SEGMENT_1 7561 non-null object 6 LOYALTY_GROUP 7561 non-null object 7 PRICE_GROUP 7561 non-null object 8 SEGMENT_2 7561 non-null object 9 GENDER 7561 non-null object 10 MARITAL_STATUS 7561 non-null object 11 AGE 7561 non-null float64 12 MOSTUSED_PLATFORM 7561 non-null object dtypes: float64(6), object(7) memory usage: 827.0+ KB
Correlations¶
from scipy.stats import chi2_contingency
# Assuming 'scaled_data' is your DataFrame
categorical_columns = ['LOYALTY_GROUP', 'PRICE_GROUP', 'SEGMENT_2', 'GENDER', 'MARITAL_STATUS', 'MOSTUSED_PLATFORM']
target = 'SEGMENT_1'
# Loop through each categorical variable
for col in categorical_columns:
# Create a contingency table
contingency_table = pd.crosstab(scaled_data[col], scaled_data[target])
# Perform Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
# Print the results
print(f'Chi-Square Test for {col} vs {target}:')
print(f'Chi2 Statistic = {chi2}, p-value = {p_value}\n')
Chi-Square Test for LOYALTY_GROUP vs SEGMENT_1: Chi2 Statistic = 278.4182793540704, p-value = 4.6567110119511346e-60 Chi-Square Test for PRICE_GROUP vs SEGMENT_1: Chi2 Statistic = 11.542819556187531, p-value = 0.02109539573677649 Chi-Square Test for SEGMENT_2 vs SEGMENT_1: Chi2 Statistic = 7561.0, p-value = 0.0 Chi-Square Test for GENDER vs SEGMENT_1: Chi2 Statistic = 1.8265144110019862, p-value = 0.17653980126228425 Chi-Square Test for MARITAL_STATUS vs SEGMENT_1: Chi2 Statistic = 1.9754552675855324, p-value = 0.37242200941084663 Chi-Square Test for MOSTUSED_PLATFORM vs SEGMENT_1: Chi2 Statistic = 57.912472538128725, p-value = 2.740440184863273e-14
The Chi-Square tests reveal that LOYALTY_GROUP, PRICE_GROUP, SEGMENT_2, and MOTSUSED_PLATFORM have a significant association with the target variable SEGMENT_1 (p-value < 0.05), indicating that these features are likely important for distinguishing between segments. Conversely, GENDER and MARITAL_STATUS show no significant association (p-value > 0.05), suggesting they may not be valuable predictors for the target, however they may still contribute in interaction with other features. So, let's keep them and thier count is just 2, not too many. Given that SEGMENT_2 is perfectly correlated correlated with SEGMENT_1 (it likely encodes the same information as SEGMENT_1), it should better be removed to avoid redundancy.
scaled_data= scaled_data.drop(columns=[col for col in scaled_data.columns if 'SEGMENT_2' in col])
scaled_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 SEGMENT_1 7561 non-null object 6 LOYALTY_GROUP 7561 non-null object 7 PRICE_GROUP 7561 non-null object 8 GENDER 7561 non-null object 9 MARITAL_STATUS 7561 non-null object 10 AGE 7561 non-null float64 11 MOSTUSED_PLATFORM 7561 non-null object dtypes: float64(6), object(6) memory usage: 767.9+ KB
One-hot Encoding¶
One-hot encoding is necessary to convert categorical variables into a numeric format that models can process. For linear models (Logistic Regression, Lasso), drop_first=True avoids multicollinearity by removing one category. For non-linear models (Random Forest, XGBoost, SVM), drop_first=False is used to retain all categories, as multicollinearity is not an issue.
# Label encode the target variable 'SEGMENT_1'
le = LabelEncoder()
scaled_data['SEGMENT_1'] = le.fit_transform(scaled_data['SEGMENT_1'])
# List of categorical columns to one-hot encode (other features)
oneHotCols = ['LOYALTY_GROUP', 'PRICE_GROUP', 'GENDER', 'MARITAL_STATUS', 'MOSTUSED_PLATFORM']
# One-hot encode the categorical features and replace True/False with 1/0
model_data = pd.get_dummies(scaled_data, columns=oneHotCols).replace({True: 1, False: 0})
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
#check linear model data
model_data.head()
| CUMSALES | FREQUENCY | AVERAGE_TICKET | RECENCY | CONSISTENCY | SEGMENT_1 | AGE | LOYALTY_GROUP_Loyal | LOYALTY_GROUP_Ocasional | LOYALTY_GROUP_Split | LOYALTY_GROUP_Vip | PRICE_GROUP_Moderately Price Insensitive | PRICE_GROUP_Moderately Price Sensitive | PRICE_GROUP_Selective Price Sensitive | PRICE_GROUP_Very Price Insensitive | PRICE_GROUP_Very Price Sensitive | GENDER_Female | GENDER_Male | MARITAL_STATUS_Divorced | MARITAL_STATUS_Married | MARITAL_STATUS_Single | MOSTUSED_PLATFORM_Mobile | MOSTUSED_PLATFORM_Web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.541602 | -0.748322 | -0.009761 | 0.681358 | -0.724761 | 0 | -0.632580 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | -0.162795 | -0.046416 | -0.139138 | 1.707561 | -0.724761 | 0 | 0.614972 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 3 | 1.400982 | 0.655490 | 1.151894 | -0.344846 | 0.030777 | 0 | -0.529552 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 5 | 0.494357 | 2.059301 | -0.540529 | -0.960568 | 2.297391 | 0 | 2.516753 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 6 | -0.801804 | -0.748322 | -0.888646 | -1.083712 | -0.724761 | 0 | 2.736446 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
Class Imbalance in Target¶
Let's check the class imbalance in the target variable one last time as we near model fitting.
# Assuming class_counts is already computed with percentages
class_counts = model_data['SEGMENT_1'].value_counts(normalize=True) * 100
# Plot the class distribution
plt.figure(figsize=(8, 6))
colors = sns.color_palette("Set2")[:2] # Limit the palette to 2 colors
ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette=colors)
# Add percentage labels inside each bar, adjusting placement
for p in ax.patches:
ax.annotate(f'{p.get_height():.2f}%',
(p.get_x() + p.get_width() / 2., p.get_height() - 5), # Position the label inside the bar
ha='center', va='center', color='black', fontsize=12)
# Title and labels
plt.title('Class Distribution of SEGMENT_1')
plt.ylabel('Percentage')
plt.xlabel('Classes')
plt.ylim(0, 100) # Set y-axis limit to ensure there's enough room for labels
plt.show()
60/40 split is good enough class balance in the target variable (Segment_1). We can proceed with next steps by maintaining this same slight imbalance across train test splits in upcoming steps. Here the classes are, 0- 'Core', 1- 'Up'
Stratified Train-Test Split¶
Time to split train and test data. i choose a good enough 20% of data to testing keeping other part for training because of a good number of data available (7.5K observations total).
model_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 SEGMENT_1 7561 non-null int32 6 AGE 7561 non-null float64 7 LOYALTY_GROUP_Loyal 7561 non-null int64 8 LOYALTY_GROUP_Ocasional 7561 non-null int64 9 LOYALTY_GROUP_Split 7561 non-null int64 10 LOYALTY_GROUP_Vip 7561 non-null int64 11 PRICE_GROUP_Moderately Price Insensitive 7561 non-null int64 12 PRICE_GROUP_Moderately Price Sensitive 7561 non-null int64 13 PRICE_GROUP_Selective Price Sensitive 7561 non-null int64 14 PRICE_GROUP_Very Price Insensitive 7561 non-null int64 15 PRICE_GROUP_Very Price Sensitive 7561 non-null int64 16 GENDER_Female 7561 non-null int64 17 GENDER_Male 7561 non-null int64 18 MARITAL_STATUS_Divorced 7561 non-null int64 19 MARITAL_STATUS_Married 7561 non-null int64 20 MARITAL_STATUS_Single 7561 non-null int64 21 MOSTUSED_PLATFORM_Mobile 7561 non-null int64 22 MOSTUSED_PLATFORM_Web 7561 non-null int64 dtypes: float64(6), int32(1), int64(16) memory usage: 1.4 MB
from sklearn.model_selection import train_test_split
# Linear model dataset
x = model_data.drop('SEGMENT_1', axis=1) # Predictor columns
y = model_data['SEGMENT_1'] # Target variable
# Train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=1)
Stratification was applied during the initial train-test split to ensure that the class distribution in the target variable (SEGMENT_1) is proportionally represented in both the training and test sets. This is crucial for imbalanced classification problems, as it avoids bias in model evaluation and ensures that the test set reflects the real-world distribution of the target classes.
PCA¶
Although not going to be implemented into the model, just doing this to know how it would turn out.
model_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 SEGMENT_1 7561 non-null int32 6 AGE 7561 non-null float64 7 LOYALTY_GROUP_Loyal 7561 non-null int64 8 LOYALTY_GROUP_Ocasional 7561 non-null int64 9 LOYALTY_GROUP_Split 7561 non-null int64 10 LOYALTY_GROUP_Vip 7561 non-null int64 11 PRICE_GROUP_Moderately Price Insensitive 7561 non-null int64 12 PRICE_GROUP_Moderately Price Sensitive 7561 non-null int64 13 PRICE_GROUP_Selective Price Sensitive 7561 non-null int64 14 PRICE_GROUP_Very Price Insensitive 7561 non-null int64 15 PRICE_GROUP_Very Price Sensitive 7561 non-null int64 16 GENDER_Female 7561 non-null int64 17 GENDER_Male 7561 non-null int64 18 MARITAL_STATUS_Divorced 7561 non-null int64 19 MARITAL_STATUS_Married 7561 non-null int64 20 MARITAL_STATUS_Single 7561 non-null int64 21 MOSTUSED_PLATFORM_Mobile 7561 non-null int64 22 MOSTUSED_PLATFORM_Web 7561 non-null int64 dtypes: float64(6), int32(1), int64(16) memory usage: 1.4 MB
# List of numerical columns (float64 type) from dataset
columns_to_include = ['CUMSALES', 'FREQUENCY', 'AVERAGE_TICKET', 'RECENCY', 'CONSISTENCY', 'AGE']
# Creating the new DataFrame 'num_data' with only the selected numerical columns
num_data = model_data[columns_to_include].copy()
# Display the new DataFrame info to verify
num_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 AGE 7561 non-null float64 dtypes: float64(6) memory usage: 413.5 KB
num_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CUMSALES | 7561.0 | 5.638482e-17 | 1.000066 | -0.995693 | -0.707614 | -0.346994 | 0.324629 | 4.344515 |
| FREQUENCY | 7561.0 | 6.014381e-17 | 1.000066 | -0.748322 | -0.748322 | -0.748322 | 0.655490 | 2.761207 |
| AVERAGE_TICKET | 7561.0 | 7.517976e-18 | 1.000066 | -1.543541 | -0.798491 | -0.189948 | 0.621031 | 3.122194 |
| RECENCY | 7561.0 | -9.303496e-17 | 1.000066 | -1.330001 | -0.878471 | -0.262749 | 0.845550 | 2.076995 |
| CONSISTENCY | 7561.0 | 6.014381e-17 | 1.000066 | -0.724761 | -0.724761 | -0.724761 | 0.786315 | 3.052928 |
| AGE | 7561.0 | 1.578775e-16 | 1.000066 | -2.080734 | -0.598641 | -0.110166 | 0.393158 | 6.855153 |
All numercial features are already in scaled in form.
from sklearn.decomposition import PCA
import plotly.express as px
# Assuming 'num_data' is your scaled numerical dataset
pca = PCA()
pca.fit(num_data)
# Get the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
# Create a cumulative explained variance
exp_var_cumul = np.cumsum(explained_variance_ratio)
# Create a DataFrame for a table of explained variance
df_explained_variance = pd.DataFrame({
"Component": range(1, len(explained_variance_ratio) + 1),
"Explained Variance Ratio": explained_variance_ratio,
"Cumulative Explained Variance Ratio": exp_var_cumul
})
# Display the DataFrame
print(df_explained_variance)
# Plot the cumulative explained variance using plotly
px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Cumulative Explained Variance"},
title="Cumulative Explained Variance by PCA Components"
).show()
Component Explained Variance Ratio Cumulative Explained Variance Ratio 0 1 0.507246 0.507246 1 2 0.200068 0.707313 2 3 0.154415 0.861728 3 4 0.118946 0.980674 4 5 0.013588 0.994262 5 6 0.005738 1.000000
How do these new dimensions relate to our original 6 variables? Let’s examine how the original variables project onto the first two components.
# List of your numerical features
features = ['CUMSALES', 'FREQUENCY', 'AVERAGE_TICKET', 'RECENCY', 'CONSISTENCY', 'AGE']
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
components = pca.fit_transform(num_data)
# Get the loadings (correlation of each feature with the principal components)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
# Create scatter plot of the first two principal components
fig = px.scatter(x=components[:, 0], y=components[:, 1], labels={'x': 'PC1', 'y': 'PC2'}, title="PCA - First Two Components", opacity= 0.5)
# Annotate the plot with feature names
for i, feature in enumerate(features):
fig.add_annotation(
ax=0, ay=0,
axref="x", ayref="y",
x=loadings[i, 0],
y=loadings[i, 1],
showarrow=True,
arrowsize=2,
arrowhead=2,
xanchor="right",
yanchor="top"
)
fig.add_annotation(
x=loadings[i, 0],
y=loadings[i, 1],
ax=0, ay=0,
xanchor="center",
yanchor="bottom",
text=feature,
yshift=5,
)
# Show the final plot
fig.show()
4 componenents explain most- 98% of the total variance. Dropping off the left out 2 compoenents is better as they only amount to less than 2% of total variance and thus we can thereby reduce dimensionality to an extent. We apply PCA with 4 components.
# Setting the number of components to 4
pca_4 = PCA(n_components=4)
components_4 = pca_4.fit_transform(num_data)
# Calculate the total variance explained by the "solution" of 4 components
total_var = pca_4.explained_variance_ratio_.sum() * 100
# Plotting a 3D scatter plot using the first 3 principal components
fig = px.scatter_3d(
components_4, x=0, y=1, z=2,
opacity= 0.5,
title=f'Total Explained Variance: {total_var:.2f}%',
labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()
# Create a new DataFrame with appropriate column names for the 4 principal components
data_pc_4 = pd.DataFrame(components_4, columns=['PC1', 'PC2', 'PC3', 'PC4'])
# Display the DataFrame with the principal components
data_pc_4
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| 0 | -1.323349 | 0.091103 | -0.687104 | 0.105848 |
| 1 | -1.028868 | 0.844333 | 0.217720 | 1.344431 |
| 2 | 1.453591 | 0.642680 | -1.057278 | -0.332180 |
| 3 | 2.902479 | -0.313900 | 2.806625 | 0.714700 |
| 4 | -0.895076 | 0.513984 | 3.002227 | -1.024102 |
| ... | ... | ... | ... | ... |
| 7556 | -0.976089 | 0.005727 | 2.789012 | -1.162678 |
| 7557 | -1.016056 | -0.724264 | 1.122888 | -1.321492 |
| 7558 | -0.145599 | 0.551135 | -0.995906 | -1.954762 |
| 7559 | -1.067307 | -1.034568 | 0.629896 | -1.350027 |
| 7560 | -0.686861 | -0.365388 | -0.002668 | -1.595536 |
7561 rows × 4 columns
# Getting the names of the original features
feature_names = num_data.columns # Use your numerical feature dataset
# Create a DataFrame to store the weights of each variable for each component
component_weights_df = pd.DataFrame(
pca_4.components_, # Adjusted for 4 components
columns=feature_names,
index=[f"Component {i+1}" for i in range(4)] # 4 components
)
# Display the DataFrame with the component weights
component_weights_df
| CUMSALES | FREQUENCY | AVERAGE_TICKET | RECENCY | CONSISTENCY | AGE | |
|---|---|---|---|---|---|---|
| Component 1 | 0.525089 | 0.526961 | 0.244727 | -0.316388 | 0.531298 | 0.065759 |
| Component 2 | 0.247993 | -0.236718 | 0.711250 | 0.300925 | -0.219109 | 0.487877 |
| Component 3 | -0.188231 | 0.077358 | -0.397514 | -0.207317 | 0.061491 | 0.868220 |
| Component 4 | 0.071457 | 0.289258 | -0.256798 | 0.875319 | 0.274380 | 0.061723 |
Since we are only leaving out 2 components by selecting 4 principal components, the value added by PCA may not be substantial enough to justify the loss in interpretability when evaluating model results. However, since our primary aim is to build the best predictive classification model rather than to interpret individual feature contributions, using PCA is a better choice as it reduces dimensionality and may improve model performance.
3- Building Predictive Models¶
Key Note¶
As we go ahead with model building by fine tuning hyparameters to make as a better model as possible, we use K-fold Cross-validation at 2 places- during hyperparameter tuning and final model evaluation for all the models. This CV is not being used to evaluate the base models in each model fitting due to computational limits and time constraints, but only after that to find optimal parameters during tuning and final model evaluation. To bring all the model performance metrics onto the same page inorder to make a valid comparison, random_seed= 1 is set across all models.
#Import necessary libraries for model building and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
3.1 Parametric Models¶
3.1.1 Logistic Regression¶
x.head()
| CUMSALES | FREQUENCY | AVERAGE_TICKET | RECENCY | CONSISTENCY | AGE | LOYALTY_GROUP_Loyal | LOYALTY_GROUP_Ocasional | LOYALTY_GROUP_Split | LOYALTY_GROUP_Vip | PRICE_GROUP_Moderately Price Insensitive | PRICE_GROUP_Moderately Price Sensitive | PRICE_GROUP_Selective Price Sensitive | PRICE_GROUP_Very Price Insensitive | PRICE_GROUP_Very Price Sensitive | GENDER_Female | GENDER_Male | MARITAL_STATUS_Divorced | MARITAL_STATUS_Married | MARITAL_STATUS_Single | MOSTUSED_PLATFORM_Mobile | MOSTUSED_PLATFORM_Web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.541602 | -0.748322 | -0.009761 | 0.681358 | -0.724761 | -0.632580 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | -0.162795 | -0.046416 | -0.139138 | 1.707561 | -0.724761 | 0.614972 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 3 | 1.400982 | 0.655490 | 1.151894 | -0.344846 | 0.030777 | -0.529552 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 5 | 0.494357 | 2.059301 | -0.540529 | -0.960568 | 2.297391 | 2.516753 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 6 | -0.801804 | -0.748322 | -0.888646 | -1.083712 | -0.724761 | 2.736446 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
x.info()
<class 'pandas.core.frame.DataFrame'> Index: 7561 entries, 1 to 9503 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUMSALES 7561 non-null float64 1 FREQUENCY 7561 non-null float64 2 AVERAGE_TICKET 7561 non-null float64 3 RECENCY 7561 non-null float64 4 CONSISTENCY 7561 non-null float64 5 AGE 7561 non-null float64 6 LOYALTY_GROUP_Loyal 7561 non-null int64 7 LOYALTY_GROUP_Ocasional 7561 non-null int64 8 LOYALTY_GROUP_Split 7561 non-null int64 9 LOYALTY_GROUP_Vip 7561 non-null int64 10 PRICE_GROUP_Moderately Price Insensitive 7561 non-null int64 11 PRICE_GROUP_Moderately Price Sensitive 7561 non-null int64 12 PRICE_GROUP_Selective Price Sensitive 7561 non-null int64 13 PRICE_GROUP_Very Price Insensitive 7561 non-null int64 14 PRICE_GROUP_Very Price Sensitive 7561 non-null int64 15 GENDER_Female 7561 non-null int64 16 GENDER_Male 7561 non-null int64 17 MARITAL_STATUS_Divorced 7561 non-null int64 18 MARITAL_STATUS_Married 7561 non-null int64 19 MARITAL_STATUS_Single 7561 non-null int64 20 MOSTUSED_PLATFORM_Mobile 7561 non-null int64 21 MOSTUSED_PLATFORM_Web 7561 non-null int64 dtypes: float64(6), int64(16) memory usage: 1.3 MB
y
1 0
2 0
3 0
5 0
6 0
..
9499 1
9500 1
9501 1
9502 1
9503 1
Name: SEGMENT_1, Length: 7561, dtype: int32
# Let's check the split of the data
print("{0:0.2f}% data is in training set".format((len(x_train)/len(model_data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(model_data.index)) * 100))
79.99% data is in training set 20.01% data is in test set
Baseline Model Fit¶
# Initialize the Logistic Regression model with the solver and random_state
logreg_model = LogisticRegression(solver="liblinear", random_state=1)
# Fit the model on the training data
logreg_model.fit(x_train, y_train)
# Predict on the test set
y_predict_logreg = logreg_model.predict(x_test)
Baseline Model Results¶
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
# Define a function to compute different metrics for classification models with model-specific names
def model_performance_classification_sklearn_with_threshold(model_name, model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance.
model_name: string, name of the model for identification
model: classifier model
predictors: independent variables (features)
target: dependent variable (target)
threshold: threshold for classifying the observation as class 1
"""
# Check if the model has predict_proba method
if hasattr(model, "predict_proba"):
pred_prob = model.predict_proba(predictors)[:, 1]
pred = np.where(pred_prob > threshold, 1, 0)
else:
# For models without predict_proba, use predict directly
pred = model.predict(predictors)
# Calculate metrics
acc = accuracy_score(target, pred) # Accuracy
recall = recall_score(target, pred) # Recall
precision = precision_score(target, pred, zero_division=0) # Precision
f1 = f1_score(target, pred) # F1 Score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Model": [model_name],
"Accuracy": [acc],
"Recall": [recall],
"Precision": [precision],
"F1 Score": [f1]
}
)
return df_perf
from sklearn.metrics import confusion_matrix
def confusion_matrix_with_counts_and_percentage(model, predictors, target, threshold=0.5):
"""
Function to compute and plot the confusion matrix for a classification model with both counts and percentages.
model: classifier
predictors: independent variables (features)
target: dependent variable (actual labels)
threshold: threshold for classifying the observation as class 1
"""
# Check if the model has predict_proba method
if hasattr(model, "predict_proba"):
pred_prob = model.predict_proba(predictors)[:, 1]
pred = np.where(pred_prob > threshold, model.classes_[1], model.classes_[0])
else:
# For models without predict_proba, use predict directly
pred = model.predict(predictors)
# Compute confusion matrix
cm = confusion_matrix(target, pred, labels=model.classes_)
# Compute percentages
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
# Create an annotation matrix with counts and percentages
annot = np.empty_like(cm).astype(str)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
annot[i, j] = f'{cm[i, j]}\n{cm_percent[i, j]:.2f}%'
# Plotting the confusion matrix with annotations for both counts and percentages
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=annot, fmt='', cmap='RdPu', cbar=False,
xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix with Counts and Percentages')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
return cm
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# Confusion matrix for the training set
confusion_matrix_with_counts_and_percentage(logreg_model, x_train, y_train)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[ 564, 1911],
[ 420, 3153]], dtype=int64)
# Now we calculate measures of fit for the training set
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold('Logistic Regression',logreg_model, x_train, y_train)
# Calculating performance in the test set
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold('Logistic Regression',logreg_model, x_test, y_test)
# Combine both into a single table for comparison
log_reg_combined_perf = pd.concat([log_reg_model_train_perf, log_reg_model_test_perf], axis=0)
log_reg_combined_perf.index = ['Train', 'Test'] # Set index labels for clarity
# Display the combined performance table
print("Logistic Regression Training and Test Performance:")
log_reg_combined_perf.T
Logistic Regression Training and Test Performance:
| Train | Test | |
|---|---|---|
| Model | Logistic Regression | Logistic Regression |
| Accuracy | 0.614583 | 0.606742 |
| Recall | 0.882452 | 0.873602 |
| Precision | 0.62263 | 0.618369 |
| F1 Score | 0.730115 | 0.724154 |
display of coefficients alongside their corresponding variables.
# Create a DataFrame with coefficients and feature names
coef_df_logreg = pd.DataFrame(logreg_model.coef_.T, index=x_train.columns, columns=['Coefficient'])
# Add the intercept to the DataFrame
coef_df_logreg.loc['Intercept'] = logreg_model.intercept_
# Sort coefficients in descending order
coef_df_logreg = coef_df_logreg.sort_values(by='Coefficient', ascending=False)
# Display the coefficients DataFrame
print("Coefficients and Intercept:")
coef_df_logreg
Coefficients and Intercept:
| Coefficient | |
|---|---|
| LOYALTY_GROUP_Vip | 2.099112 |
| LOYALTY_GROUP_Loyal | 1.072741 |
| Intercept | 0.509945 |
| MOSTUSED_PLATFORM_Web | 0.444080 |
| AVERAGE_TICKET | 0.381331 |
| GENDER_Female | 0.324417 |
| PRICE_GROUP_Moderately Price Insensitive | 0.217088 |
| MARITAL_STATUS_Single | 0.197202 |
| MARITAL_STATUS_Divorced | 0.191661 |
| PRICE_GROUP_Selective Price Sensitive | 0.188782 |
| GENDER_Male | 0.185528 |
| FREQUENCY | 0.181467 |
| MARITAL_STATUS_Married | 0.121082 |
| MOSTUSED_PLATFORM_Mobile | 0.065866 |
| PRICE_GROUP_Very Price Sensitive | 0.064464 |
| PRICE_GROUP_Moderately Price Sensitive | 0.038113 |
| AGE | 0.027825 |
| PRICE_GROUP_Very Price Insensitive | 0.001499 |
| RECENCY | -0.026280 |
| CONSISTENCY | -0.055085 |
| CUMSALES | -0.849724 |
| LOYALTY_GROUP_Ocasional | -1.172957 |
| LOYALTY_GROUP_Split | -1.488951 |
Classification Metrics Using K-fold CV¶
With a slight class imbalance in the target variable- segment_1(60-40), a stratified K-fold does a good job at maintaining balanced proportions in each split.
from sklearn.model_selection import cross_val_predict, StratifiedKFold
# Deploying Stratified K-fold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
# Get cross-validated predictions for the entire dataset
y_pred_cv_logreg = cross_val_predict(logreg_model, x, y, cv=skf)
# Calculate metrics using the cross-validated predictions
accuracy_cv_logreg = accuracy_score(y, y_pred_cv_logreg)
precision_cv_logreg = precision_score(y, y_pred_cv_logreg)
recall_cv_logreg = recall_score(y, y_pred_cv_logreg)
f1_cv_logreg = f1_score(y, y_pred_cv_logreg)
# Creating a summary table for the CV results specific to Logistic Regression
logreg_cv_metrics = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'Cross-Validated Score': [accuracy_cv_logreg, precision_cv_logreg, recall_cv_logreg, f1_cv_logreg]
})
# Display the metrics summary after cross-validation
print("Cross-Validation Performance for Logistic Regression:")
logreg_cv_metrics
Cross-Validation Performance for Logistic Regression:
| Metric | Cross-Validated Score | |
|---|---|---|
| 0 | Accuracy | 0.613675 |
| 1 | Precision | 0.621503 |
| 2 | Recall | 0.885158 |
| 3 | F1 Score | 0.730261 |
Confusion Matrix¶
#creating function that charts confusion matrix for CV results
def confusion_matrix_with_cv_predictions(y_true, y_pred, labels):
"""
Function to compute and plot the confusion matrix with both counts and percentages using precomputed predictions.
y_true: actual labels
y_pred: predicted labels
labels: model classes or label names
"""
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=labels)
# Compute percentages
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
# Create an annotation matrix with counts and percentages
annot = np.empty_like(cm).astype(str)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
annot[i, j] = f'{cm[i, j]}\n{cm_percent[i, j]:.2f}%'
# Plot the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', cbar=False, xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix with Counts and Percentages')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
return cm # Return the confusion matrix for further analysis if needed
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# Plot confusion matrix for CV predictions
confusion_matrix_with_cv_predictions(y, y_pred_cv_logreg, labels=logreg_model.classes_)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[ 686, 2408],
[ 513, 3954]], dtype=int64)
3.1.2 Lasso¶
To avoid not so useful steps and that we have to tune parameters we jump straight into finding the optimal value for C (inverse alpha for logreg specific) for our Lasso model to build best lasso model directly.
Hyperparameter Tuning Using K-fold CV¶
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.exceptions import UndefinedMetricWarning
# Suppress UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
# Define a range of C values (inverse of alpha in Logistic Regression)
C_values = np.logspace(-4, 4, 50) # Similar to alpha, where C is 1/alpha
# Initialize variables to store the best scores and corresponding C
best_accuracy = 0
best_C = None
# Lists to store metrics for each C value
accuracy_values = []
precision_values = []
recall_values = []
f1_values = []
# Perform cross-validation with StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# Iterate over the C values (regularization strengths)
for C in C_values:
logreg_lasso = LogisticRegression(penalty='l1', solver='liblinear', C=C, random_state=1) # Logistic Regression with L1 regularization
# Cross-validate the model
cv_accuracy_scores = cross_val_score(logreg_lasso, x_train, y_train, cv=skf, scoring='accuracy')
cv_precision_scores = cross_val_score(logreg_lasso, x_train, y_train, cv=skf, scoring='precision')
cv_recall_scores = cross_val_score(logreg_lasso, x_train, y_train, cv=skf, scoring='recall')
cv_f1_scores = cross_val_score(logreg_lasso, x_train, y_train, cv=skf, scoring='f1')
# Calculate average cross-validation score for each metric
accuracy = np.mean(cv_accuracy_scores)
precision = np.mean(cv_precision_scores)
recall = np.mean(cv_recall_scores)
f1 = np.mean(cv_f1_scores)
# Store metrics for plotting
accuracy_values.append(accuracy)
precision_values.append(precision)
recall_values.append(recall)
f1_values.append(f1)
# Update the best score and C value based on accuracy
if accuracy > best_accuracy:
best_accuracy = accuracy
best_C = C
# Print the best C value and corresponding accuracy
print(f"Best C for Logistic Regression with Lasso: {best_C:.4f} with Accuracy: {best_accuracy:.4f}")
# Plot the metrics against C values
plt.figure(figsize=(12, 8))
# Accuracy plot
plt.subplot(2, 3, 1)
plt.plot(C_values, accuracy_values, label='Accuracy', color='blue')
plt.xscale('log')
plt.xlabel('C (1/alpha)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs C')
# Precision plot
plt.subplot(2, 3, 2)
plt.plot(C_values, precision_values, label='Precision', color='green')
plt.xscale('log')
plt.xlabel('C (1/alpha)')
plt.ylabel('Precision')
plt.title('Precision vs C')
# Recall plot
plt.subplot(2, 3, 3)
plt.plot(C_values, recall_values, label='Recall', color='orange')
plt.xscale('log')
plt.xlabel('C (1/alpha)')
plt.ylabel('Recall')
plt.title('Recall vs C')
# F1 Score plot
plt.subplot(2, 3, 4)
plt.plot(C_values, f1_values, label='F1 Score', color='purple')
plt.xscale('log')
plt.xlabel('C (1/alpha)')
plt.ylabel('F1 Score')
plt.title('F1 Score vs C')
plt.tight_layout()
plt.show()
Best C for Logistic Regression with Lasso: 0.0869 with Accuracy: 0.6171
Optimal Model Fit¶
Classification Metrics Using K-fold CV¶
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
# Define the model
final_logreg_lasso = LogisticRegression(penalty='l1', solver='liblinear', C=best_C, random_state=1)
# Set up Stratified K-Fold cross-validation (e.g., 5 folds)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# Get cross-validated predictions on the train set
y_pred_cv_train = cross_val_predict(final_logreg_lasso, x_train, y_train, cv=skf)
# Calculate metrics for cross-validated train set predictions
cv_accuracy_train = accuracy_score(y_train, y_pred_cv_train)
cv_precision_train = precision_score(y_train, y_pred_cv_train, zero_division=0)
cv_recall_train = recall_score(y_train, y_pred_cv_train)
cv_f1_train = f1_score(y_train, y_pred_cv_train)
# Fit the model on the entire train set and evaluate on the test set
final_logreg_lasso.fit(x_train, y_train)
y_pred_test = final_logreg_lasso.predict(x_test)
# Calculate metrics on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test, zero_division=0)
recall_test = recall_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test)
# Creating a DataFrame to compare metrics between cross-validated train set and test set
lasso_metrics_df = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'Lasso Train': [cv_accuracy_train, cv_precision_train, cv_recall_train, cv_f1_train],
'Lasso Test': [accuracy_test, precision_test, recall_test, f1_test]
})
# Display the DataFrame with the performance metrics
print("Final Performance Metrics of Lasso after Tuning:")
lasso_metrics_df
Final Performance Metrics of Lasso after Tuning:
| Metric | Lasso Train | Lasso Test | |
|---|---|---|---|
| 0 | Accuracy | 0.617063 | 0.606742 |
| 1 | Precision | 0.614795 | 0.609846 |
| 2 | Recall | 0.942065 | 0.928412 |
| 3 | F1 Score | 0.744032 | 0.736142 |
Confusion Matrix¶
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# Confusion matrix for the test set
confusion_matrix_with_counts_and_percentage(final_logreg_lasso, x_test, y_test)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[ 88, 531],
[ 64, 830]], dtype=int64)
Feature Importance¶
# Retrieve the coefficients from the final Lasso model
lasso_coefficients = final_logreg_lasso.coef_[0]
# Create a DataFrame to display the coefficients alongside their corresponding feature names
lasso_coefficients_df = pd.DataFrame({
'Feature': x_train.columns, # Feature names
'Coefficient': lasso_coefficients # Lasso coefficients
})
# Sort the DataFrame based on Lasso coefficients for better clarity
lasso_coefficients_df = lasso_coefficients_df.sort_values(by='Coefficient', ascending=False).reset_index(drop=True)
# Plot the feature importance (coefficients)
plt.figure(figsize=(12, 8)) # Adjusting the size to accommodate many features
bars = plt.barh(lasso_coefficients_df['Feature'], lasso_coefficients_df['Coefficient'], color='skyblue')
# Make the plot scrollable by rotating labels and adjusting limits
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.xlabel('Coefficient')
plt.ylabel('Features')
plt.title('Feature Importance (Lasso Coefficients)')
plt.xticks(rotation=45)
# Display a horizontal scrollable plot by adjusting figsize width and label size
plt.show()
# Display the coefficients in a table
print("Feature Importance (Lasso Coefficients):")
lasso_coefficients_df
Feature Importance (Lasso Coefficients):
| Feature | Coefficient | |
|---|---|---|
| 0 | LOYALTY_GROUP_Vip | 1.798204 |
| 1 | LOYALTY_GROUP_Loyal | 1.507323 |
| 2 | MOSTUSED_PLATFORM_Web | 0.340726 |
| 3 | AVERAGE_TICKET | 0.185441 |
| 4 | PRICE_GROUP_Moderately Price Insensitive | 0.089952 |
| 5 | GENDER_Female | 0.082822 |
| 6 | PRICE_GROUP_Selective Price Sensitive | 0.060733 |
| 7 | AGE | 0.011581 |
| 8 | PRICE_GROUP_Very Price Sensitive | 0.000000 |
| 9 | MOSTUSED_PLATFORM_Mobile | 0.000000 |
| 10 | MARITAL_STATUS_Single | 0.000000 |
| 11 | MARITAL_STATUS_Divorced | 0.000000 |
| 12 | GENDER_Male | 0.000000 |
| 13 | PRICE_GROUP_Moderately Price Sensitive | 0.000000 |
| 14 | FREQUENCY | 0.000000 |
| 15 | CONSISTENCY | 0.000000 |
| 16 | LOYALTY_GROUP_Ocasional | -0.005145 |
| 17 | RECENCY | -0.008323 |
| 18 | MARITAL_STATUS_Married | -0.027812 |
| 19 | PRICE_GROUP_Very Price Insensitive | -0.036598 |
| 20 | CUMSALES | -0.393244 |
| 21 | LOYALTY_GROUP_Split | -0.556054 |
3.2 Non-Parametric Models¶
3.2.1 Random Forest¶
Baseline Model Fit¶
First, let's create a base model with 'n_estimators' set to 10, without specifying any other parameters.
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
# Create a classifier object (instead of regressor)
rf_classifier = RandomForestClassifier(n_estimators=10, random_state=1) # Specify 10 trees
# Fit the classifier with the training data
rf_classifier.fit(x_train, y_train)
RandomForestClassifier(n_estimators=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=10, random_state=1)
# We have created the model "rf_classifier", and it has been trained
# Let's see the specifications of the model created:
params_rf_classifier = rf_classifier.get_params()
print(params_rf_classifier)
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': 1, 'verbose': 0, 'warm_start': False}
Note that if we print params_rf_regressor (the variable containing the model's 'specifications'), we can see that MSE is the error criterion used, and the bootstrap option is set to TRUE. Additionally, there is no indication of the tree's depth or other parameters. Let's examine the results:
# Import necessary library for tree plotting
from sklearn.tree import plot_tree
# Choose one of the trees from the "Forest"
tree = rf_classifier.estimators_[1] # Randomly picking the 2nd tree in the forest
# Plot the tree
plt.figure(figsize=(20,10))
plot_tree(tree, filled=True, feature_names=x_train.columns, rounded=True)
plt.savefig('decision_tree.jpg', format='jpg', dpi=300, bbox_inches='tight')
plt.show()
Examining a single tree in the 'forest' reveals a high level of complexity.
Baseline Model Results¶
# Make predictions on the training data
y_train_rf_base_pred = rf_classifier.predict(x_train)
# Calculate accuracy, precision, recall, F1-score for the training data
accuracy_train_rf_base = accuracy_score(y_train, y_train_rf_base_pred)
precision_train_rf_base = precision_score(y_train, y_train_rf_base_pred, average='binary', zero_division=0)
recall_train_rf_base = recall_score(y_train, y_train_rf_base_pred, average='binary')
f1_train_rf_base = f1_score(y_train, y_train_rf_base_pred, average='binary')
# Make predictions using the model on the test data
y_test_rf_base_pred = rf_classifier.predict(x_test)
# Calculate accuracy, precision, recall, F1-score for the test data
accuracy_test_rf_base = accuracy_score(y_test, y_test_rf_base_pred)
precision_test_rf_base = precision_score(y_test, y_test_rf_base_pred, average='binary', zero_division=0)
recall_test_rf_base = recall_score(y_test, y_test_rf_base_pred, average='binary')
f1_test_rf_base = f1_score(y_test, y_test_rf_base_pred, average='binary')
# Create a summary table
metrics = {
'Accuracy': {'Train': accuracy_train_rf_base, 'Test': accuracy_test_rf_base},
'Precision': {'Train': precision_train_rf_base, 'Test': precision_test_rf_base},
'Recall': {'Train': recall_train_rf_base, 'Test': recall_test_rf_base},
'F1 Score': {'Train': f1_train_rf_base, 'Test': f1_test_rf_base}
}
# Create the DataFrame from the dictionary
metrics_rf_base_df = pd.DataFrame(metrics)
print('Results for Basic RF model')
metrics_rf_base_df.T
Results for Basic RF model
| Train | Test | |
|---|---|---|
| Accuracy | 0.983300 | 0.575677 |
| Precision | 0.992063 | 0.645161 |
| Recall | 0.979569 | 0.626398 |
| F1 Score | 0.985777 | 0.635641 |
Hyperparameter Tuning Using K-fold CV¶
The significant drop in test accuracy than tarin accuracy shows that the model was overfit. It is time to prune it.
Let's develop a function to determine the optimal maximum tree depth, number of trees, and the best variable selection for splits, with the goal of minimizing the test MSE. The general range for max_depth and n_estimators is set and can be increased if those limits for both those parameters are touched. It is kind of dynamic tuning approach.
from sklearn.model_selection import cross_val_score
def find_optimal_rf_params_and_plot_cv(x_train, y_train, n_estimators_range, max_depth_range, cv=5):
# Initialize variables to store the optimal parameters and results for plotting
max_accuracy = 0
best_n_estimators = None
best_max_depth = None
plot_data = []
# Iterate over all combinations of n_estimators and max_depth
for n_estimators in n_estimators_range:
for max_depth in max_depth_range:
# Create RandomForestClassifier model
rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=1)
# Perform cross-validation
cv_scores = cross_val_score(rf, x_train, y_train, cv=cv, scoring='accuracy')
# Calculate mean CV accuracy
accuracy = cv_scores.mean()
plot_data.append((n_estimators, max_depth, accuracy))
# Update the optimal parameters if current accuracy is higher than max_accuracy
if accuracy > max_accuracy:
max_accuracy = accuracy
best_n_estimators = n_estimators
best_max_depth = max_depth
# Plotting the results
plot_df = pd.DataFrame(plot_data, columns=['n_estimators', 'max_depth', 'Accuracy'])
fig, ax = plt.subplots(figsize=(12, 6))
for n_estimator in n_estimators_range:
subset = plot_df[plot_df['n_estimators'] == n_estimator]
ax.plot(subset['max_depth'], subset['Accuracy'], label=f'n_estimators={n_estimator}')
ax.set_xlabel('Max Depth')
ax.set_ylabel('Cross-Validated Accuracy')
ax.set_title('Evolution of Accuracy with Different n_estimators and max_depth (CV)')
ax.legend()
plt.show()
return best_n_estimators, best_max_depth, max_accuracy
# Example usage with CV
n_estimators_range = range(1, 30, 2) # Define ranges for n_estimators and max_depth
max_depth_range = range(2, 20)
optimal_n_estimators, optimal_max_depth, optimal_accuracy = find_optimal_rf_params_and_plot_cv(
x_train, y_train, n_estimators_range, max_depth_range, cv=5
)
print(f"Optimal n_estimators: {optimal_n_estimators}, Optimal max_depth: {optimal_max_depth}, Optimal Accuracy: {optimal_accuracy:.4f}")
Optimal n_estimators: 27, Optimal max_depth: 11, Optimal Accuracy: 0.6316
Optimal Model Fit¶
An optimal model, consisting of 27 trees with a maximum depth of 11 per tree, is identified for achieving the maximum accuracy. This suggests the inference of complex trees. But, with a patient and closer look at the graph would tell us that the model with n_estimators= 9 and max depth= 8 (slight eyeballing) performs as good as this best model, yet this one is much simpler than the best model found and giving CV accuracy almost very much close to that of the best model. So, it is best to prefer the model with 9 tress and depth of 8 that gives very good accuracy among all of these.
Classification Metrics Using K-fold CV¶
# Create the optimal RandomForestClassifier
optim_rf_classifier = RandomForestClassifier(n_estimators= 9, max_depth= 8, random_state=1)
# Perform cross-validation to evaluate performance on the training data (using accuracy, precision, recall, F1 score)
cv_accuracy_scores = cross_val_score(optim_rf_classifier, x_train, y_train, cv=5, scoring='accuracy')
cv_precision_scores = cross_val_score(optim_rf_classifier, x_train, y_train, cv=5, scoring='precision')
cv_recall_scores = cross_val_score(optim_rf_classifier, x_train, y_train, cv=5, scoring='recall')
cv_f1_scores = cross_val_score(optim_rf_classifier, x_train, y_train, cv=5, scoring='f1')
# Calculate mean CV scores
cv_accuracy = cv_accuracy_scores.mean()
cv_precision = cv_precision_scores.mean()
cv_recall = cv_recall_scores.mean()
cv_f1 = cv_f1_scores.mean()
# Now, fit the model on the full training data
optim_rf_classifier.fit(x_train, y_train)
# Predict on the test data
y_test_rf_optim_pred = optim_rf_classifier.predict(x_test)
# Calculate test set metrics
accuracy_test_rf_optim = accuracy_score(y_test, y_test_rf_optim_pred)
precision_test_rf_optim = precision_score(y_test, y_test_rf_optim_pred, zero_division=0)
recall_test_rf_optim = recall_score(y_test, y_test_rf_optim_pred)
f1_test_rf_optim = f1_score(y_test, y_test_rf_optim_pred)
# Creating a DataFrame to compare metrics between cross-validated train set and test set
rf_metrics_df = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'RF Train': [cv_accuracy, cv_precision, cv_recall, cv_f1],
'RFTest': [accuracy_test_rf_optim, precision_test_rf_optim, recall_test_rf_optim, f1_test_rf_optim]
})
# Display the DataFrame with the performance metrics
print("Performance Metrics of Random Forest:")
rf_metrics_df
Performance Metrics of Random Forest:
| Metric | RF Train | RFTest | |
|---|---|---|---|
| 0 | Accuracy | 0.626158 | 0.638467 |
| 1 | Precision | 0.624058 | 0.629963 |
| 2 | Recall | 0.923881 | 0.940716 |
| 3 | F1 Score | 0.744872 | 0.754598 |
Again, graph for 1 tree:
# Choose one tree from the forest
tree = optim_rf_classifier.estimators_[0]
# Plot the tree
plt.figure(figsize=(20,10))
plot_tree(tree, filled=True, feature_names=x_train.columns, rounded=True)
plt.show()
Confusion Matrix¶
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# We've already defined the function `confusion_matrix_with_counts_and_percentage` earlier
# Call the function to display the confusion matrix for the Random Forest model
confusion_matrix_with_counts_and_percentage(optim_rf_classifier, x_test, y_test)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[125, 494],
[ 53, 841]], dtype=int64)
Feature Importance¶
Lets see the feature importance of each independent variable in our improvized model, optim_rf_classifier.
# Get feature importances
feature_importances = optim_rf_classifier.feature_importances_
# Create a DataFrame for feature importances
RF_coefficients_df = pd.DataFrame({
'Feature': x_train.columns, # Feature names
'Importance': feature_importances # Feature importance from Random Forest
})
# Sort the DataFrame by importance for better clarity
RF_coefficients_df = RF_coefficients_df.sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Plot the feature importances (horizontal bar chart)
plt.figure(figsize=(12, 8)) # Adjust the size to accommodate many features
bars = plt.barh(RF_coefficients_df['Feature'], RF_coefficients_df['Importance'], color='skyblue')
# Add importance values on top of bars
for bar in bars:
plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f'{bar.get_width():.4f}', va='center')
# Invert y-axis for better readability
plt.gca().invert_yaxis()
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importances in Optimal Random Forest Model')
# Display a horizontal scrollable plot by adjusting figsize width and label size
plt.show()
# Display the feature importances in a table
print("Feature Importances (Random Forest):")
RF_coefficients_df
Feature Importances (Random Forest):
| Feature | Importance | |
|---|---|---|
| 0 | CUMSALES | 0.234167 |
| 1 | AVERAGE_TICKET | 0.117373 |
| 2 | CONSISTENCY | 0.098007 |
| 3 | LOYALTY_GROUP_Split | 0.093090 |
| 4 | AGE | 0.083430 |
| 5 | LOYALTY_GROUP_Loyal | 0.073319 |
| 6 | RECENCY | 0.068930 |
| 7 | FREQUENCY | 0.052817 |
| 8 | MOSTUSED_PLATFORM_Mobile | 0.034877 |
| 9 | LOYALTY_GROUP_Ocasional | 0.032877 |
| 10 | MOSTUSED_PLATFORM_Web | 0.013294 |
| 11 | MARITAL_STATUS_Married | 0.012498 |
| 12 | GENDER_Male | 0.011652 |
| 13 | GENDER_Female | 0.011103 |
| 14 | LOYALTY_GROUP_Vip | 0.010499 |
| 15 | MARITAL_STATUS_Single | 0.009536 |
| 16 | PRICE_GROUP_Moderately Price Insensitive | 0.008620 |
| 17 | PRICE_GROUP_Very Price Insensitive | 0.008362 |
| 18 | PRICE_GROUP_Very Price Sensitive | 0.008360 |
| 19 | PRICE_GROUP_Selective Price Sensitive | 0.007476 |
| 20 | PRICE_GROUP_Moderately Price Sensitive | 0.006396 |
| 21 | MARITAL_STATUS_Divorced | 0.003318 |
3.2.2 Gradient Boost¶
Baseline Model Fit¶
To identify the best model, let's begin with a base Gradient Boosting model, setting n_estimators to 10, similar to our initial approach with the Random Forest base model.
# Import necessary libraries
from sklearn.ensemble import GradientBoostingClassifier
# Create a Gradient Boosting classifier object
gb_classifier = GradientBoostingClassifier(n_estimators=10, random_state=1) # Specify 10 boosting stages (trees)
# Fit the classifier with the training data
gb_classifier.fit(x_train, y_train)
GradientBoostingClassifier(n_estimators=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(n_estimators=10, random_state=1)
# Get the parameters of the Gradient Boosting classifier model
params_gb_classifier = gb_classifier.get_params()
# Print the parameters
print(params_gb_classifier)
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 10, 'n_iter_no_change': None, 'random_state': 1, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Note that if we print params_gb_regressor (the variable containing the model's 'specifications'), we can see that the error criterion used is the friedman_mse, and the max_depth is set to 3 as a default value. Let's examine the results:
from sklearn.tree import plot_tree
# Choose which tree to visualize (0 for the first tree)
tree_index = 0
# Extract the tree from the classifier (use estimators_)
single_tree = gb_classifier.estimators_[tree_index, 0]
# Plot the specified decision tree
plt.figure(figsize=(20, 10))
plot_tree(single_tree, filled=True, feature_names=x_train.columns, rounded=True)
plt.show()
Baseline Model Results¶
The tree is both clear and easily interpretable, unlike in Random Forest. Now, let's review the performance measures, R-squared and MSE, for both the training and testing evaluations.
# Make predictions on the training data
y_train_gb_base_pred = gb_classifier.predict(x_train)
# Calculate accuracy, precision, recall, F1-score for the training data
accuracy_train_gb_base = accuracy_score(y_train, y_train_gb_base_pred)
precision_train_gb_base = precision_score(y_train, y_train_gb_base_pred, average='binary', zero_division=0)
recall_train_gb_base = recall_score(y_train, y_train_gb_base_pred, average='binary')
f1_train_gb_base = f1_score(y_train, y_train_gb_base_pred, average='binary')
# Make predictions using the model on the test data
y_test_gb_base_pred = gb_classifier.predict(x_test)
# Calculate accuracy, precision, recall, F1-score for the test data
accuracy_test_gb_base = accuracy_score(y_test, y_test_gb_base_pred)
precision_test_gb_base = precision_score(y_test, y_test_gb_base_pred, average='binary', zero_division=0)
recall_test_gb_base = recall_score(y_test, y_test_gb_base_pred, average='binary')
f1_test_gb_base = f1_score(y_test, y_test_gb_base_pred, average='binary')
# Create a summary table
metrics_gb_base = {
'Accuracy': {'Train': accuracy_train_gb_base, 'Test': accuracy_test_gb_base},
'Precision': {'Train': precision_train_gb_base, 'Test': precision_test_gb_base},
'Recall': {'Train': recall_train_gb_base, 'Test': recall_test_gb_base},
'F1 Score': {'Train': f1_train_gb_base, 'Test': f1_test_gb_base}
}
# Create the DataFrame from the dictionary
metrics_gb_base_df = pd.DataFrame(metrics_gb_base)
print('Results for Basic GB model')
metrics_gb_base_df.T
Results for Basic GB model
| Train | Test | |
|---|---|---|
| Accuracy | 0.626653 | 0.627231 |
| Precision | 0.612953 | 0.613324 |
| Recall | 0.998601 | 0.998881 |
| F1 Score | 0.759634 | 0.760000 |
Hyperparameter Tuning Using K-fold CV¶
All the train and test metrics are very close. Seems like we have hit the right parameters for the trees. Let's check.
def find_optimal_gb_params_and_plot_cv(x_train, y_train, n_estimators_range, max_depth_range, cv=5):
# Initialize variables to store the optimal parameters and results for plotting
max_accuracy = 0
best_n_estimators = None
best_max_depth = None
plot_data = []
# Iterate over all combinations of n_estimators and max_depth
for n_estimators in n_estimators_range:
for max_depth in max_depth_range:
# Create GradientBoostingClassifier model
gb = GradientBoostingClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=1)
# Perform cross-validation
cv_scores = cross_val_score(gb, x_train, y_train, cv=cv, scoring='accuracy')
# Calculate mean CV accuracy
accuracy = cv_scores.mean()
plot_data.append((n_estimators, max_depth, accuracy))
# Update the optimal parameters if current accuracy is higher than max_accuracy
if accuracy > max_accuracy:
max_accuracy = accuracy
best_n_estimators = n_estimators
best_max_depth = max_depth
# Plotting the results
plot_df = pd.DataFrame(plot_data, columns=['n_estimators', 'max_depth', 'Accuracy'])
fig, ax = plt.subplots(figsize=(12, 6))
for n_estimator in n_estimators_range:
subset = plot_df[plot_df['n_estimators'] == n_estimator]
ax.plot(subset['max_depth'], subset['Accuracy'], label=f'n_estimators={n_estimator}')
ax.set_xlabel('Max Depth')
ax.set_ylabel('Cross-Validated Accuracy')
ax.set_title('Evolution of Accuracy with Different n_estimators and max_depth (CV)')
ax.legend()
plt.show()
return best_n_estimators, best_max_depth, max_accuracy
# Example usage with CV
n_estimators_range = range(1, 30, 2) # Define ranges for n_estimators and max_depth
max_depth_range = range(2, 10)
optimal_n_estimators, optimal_max_depth, optimal_accuracy = find_optimal_gb_params_and_plot_cv(
x_train, y_train, n_estimators_range, max_depth_range, cv=5
)
print(f"Optimal n_estimators: {optimal_n_estimators}, Optimal max_depth: {optimal_max_depth}, Optimal Accuracy: {optimal_accuracy:.4f}")
Optimal n_estimators: 15, Optimal max_depth: 5, Optimal Accuracy: 0.6379
Optimal Model Fit¶
A model with 15 trees and a maximum depth of 5 per tree is optimal for achieving the higher accuracy.
# Create the optimal GradientBoostingClassifier
optim_gb_classifier = GradientBoostingClassifier(n_estimators=optimal_n_estimators, max_depth=optimal_max_depth, random_state=1)
Classification Metrics Using K-fold CV¶
Let's examine the measures of fit for this optimal model, both for training and testing.
# Create the optimal GradientBoostingClassifier
optim_gb_classifier = GradientBoostingClassifier(n_estimators=optimal_n_estimators, max_depth=optimal_max_depth, random_state=1)
# Perform cross-validation to evaluate performance on the training data (using accuracy, precision, recall, F1 score)
cv_accuracy_scores = cross_val_score(optim_gb_classifier, x_train, y_train, cv=5, scoring='accuracy')
cv_precision_scores = cross_val_score(optim_gb_classifier, x_train, y_train, cv=5, scoring='precision')
cv_recall_scores = cross_val_score(optim_gb_classifier, x_train, y_train, cv=5, scoring='recall')
cv_f1_scores = cross_val_score(optim_gb_classifier, x_train, y_train, cv=5, scoring='f1')
# Calculate mean CV scores
cv_accuracy = cv_accuracy_scores.mean()
cv_precision = cv_precision_scores.mean()
cv_recall = cv_recall_scores.mean()
cv_f1 = cv_f1_scores.mean()
# Now, fit the model on the full training data
optim_gb_classifier.fit(x_train, y_train)
# Predict on the test data
y_test_gb_optim_pred = optim_gb_classifier.predict(x_test)
# Calculate test set metrics
accuracy_test_gb_optim = accuracy_score(y_test, y_test_gb_optim_pred)
precision_test_gb_optim = precision_score(y_test, y_test_gb_optim_pred, zero_division=0)
recall_test_gb_optim = recall_score(y_test, y_test_gb_optim_pred)
f1_test_gb_optim = f1_score(y_test, y_test_gb_optim_pred)
# Creating a DataFrame to compare metrics between cross-validated train set and test set
gb_metrics_df = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'GB Train': [cv_accuracy, cv_precision, cv_recall, cv_f1],
'GB Test': [accuracy_test_gb_optim, precision_test_gb_optim, recall_test_gb_optim, f1_test_gb_optim]
})
# Display the DataFrame with the performance metrics
print("Performance Metrics of Gradient Boosting:")
gb_metrics_df
Performance Metrics of Gradient Boosting:
| Metric | GB Train | GB Test | |
|---|---|---|---|
| 0 | Accuracy | 0.637897 | 0.634501 |
| 1 | Precision | 0.623253 | 0.620665 |
| 2 | Recall | 0.978733 | 0.980984 |
| 3 | F1 Score | 0.761541 | 0.760295 |
Let's see the a single tree from the tuned model.
from sklearn.tree import plot_tree
# Choose which tree to visualize (0 for the first tree)
tree_index = 0
# Extract the tree from the classifier
single_tree = optim_gb_classifier.estimators_[tree_index, 0] # Classifier version
# Plot the specified decision tree
plt.figure(figsize=(20, 10))
plot_tree(single_tree, filled=True, feature_names=x_train.columns, rounded=True)
# Save the plot to a file
plt.savefig('decision_tree.jpg', format='jpg', dpi=300, bbox_inches='tight')
# Show the plot
plt.show()
Confusion Matrix¶
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# Call the function to display the confusion matrix for the Gradient Boosting model
confusion_matrix_with_counts_and_percentage(optim_gb_classifier, x_test, y_test)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[ 83, 536],
[ 17, 877]], dtype=int64)
Feature Importance¶
Let's see the feature importance according the gradient boosting model.
# Get feature importances
feature_importances_gb = optim_gb_classifier.feature_importances_
# Create a DataFrame for feature importances
GB_coefficients_df = pd.DataFrame({
'Feature': x_train.columns, # Feature names
'Importance': feature_importances_gb # Feature importance from Gradient Boosting
})
# Sort the DataFrame by importance for better clarity
GB_coefficients_df = GB_coefficients_df.sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Plot the feature importances (horizontal bar chart)
plt.figure(figsize=(12, 8)) # Adjust the size to accommodate many features
bars = plt.barh(GB_coefficients_df['Feature'], GB_coefficients_df['Importance'], color='skyblue')
# Add importance values on top of bars
for bar in bars:
plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f'{bar.get_width():.4f}', va='center')
# Invert y-axis for better readability
plt.gca().invert_yaxis()
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importances in Optimal Gradient Boosting Model')
# Display a horizontal scrollable plot by adjusting figsize width and label size
plt.show()
# Display the feature importances in a table
print("Feature Importances (Gradient Boosting):")
GB_coefficients_df
Feature Importances (Gradient Boosting):
| Feature | Importance | |
|---|---|---|
| 0 | CUMSALES | 0.317893 |
| 1 | LOYALTY_GROUP_Split | 0.150193 |
| 2 | LOYALTY_GROUP_Ocasional | 0.111037 |
| 3 | CONSISTENCY | 0.106472 |
| 4 | FREQUENCY | 0.091316 |
| 5 | AGE | 0.055004 |
| 6 | AVERAGE_TICKET | 0.043205 |
| 7 | MOSTUSED_PLATFORM_Web | 0.032430 |
| 8 | LOYALTY_GROUP_Loyal | 0.027799 |
| 9 | LOYALTY_GROUP_Vip | 0.022732 |
| 10 | MOSTUSED_PLATFORM_Mobile | 0.020169 |
| 11 | RECENCY | 0.012317 |
| 12 | GENDER_Male | 0.002187 |
| 13 | GENDER_Female | 0.002020 |
| 14 | PRICE_GROUP_Very Price Insensitive | 0.001822 |
| 15 | PRICE_GROUP_Selective Price Sensitive | 0.001545 |
| 16 | PRICE_GROUP_Moderately Price Insensitive | 0.000965 |
| 17 | PRICE_GROUP_Moderately Price Sensitive | 0.000679 |
| 18 | MARITAL_STATUS_Married | 0.000215 |
| 19 | PRICE_GROUP_Very Price Sensitive | 0.000000 |
| 20 | MARITAL_STATUS_Divorced | 0.000000 |
| 21 | MARITAL_STATUS_Single | 0.000000 |
Even gradient boost model gave some of the features some of which are common as in previous models- not so predictive power with low coefficient values.
3.2.3 Support Vector Machine¶
Bivariate Plots¶
Using the pair plot from EDA on the dataset before scaling, just with its raw variables can give an idea on what specifc kernal could be more suited through eye-balling.
# Pair plot with 'SEGMENT_1' as the hue (color-coded by target class)
sns.pairplot(scaled_data, hue='SEGMENT_1', diag_kind='kde', height=2)
plt.show()
The above pairplot barely indicates any clear linear separation between the target classes among features. With the scattered points spread out everywhere, RBF Kernal or polynomial Kernal seems like better suited. But, even though pair plots might show scattered data in 2D, linear separation might be possible in higher dimensions. So, let's try all the Kernals which is better using Cross- validation.
Decision on Kernal- Linear Wins!¶
# Function to train and evaluate an SVM model with flexibility for C, gamma, and degree
def train_evaluate_svm(kernel_type, x_train, y_train, x_test, y_test, C=1, degree=3, gamma='scale'):
svm_model = SVC(kernel=kernel_type, C=C, degree=degree, gamma=gamma, random_state=1)
svm_model.fit(x_train, y_train)
# Predictions
y_train_pred = svm_model.predict(x_train)
y_test_pred = svm_model.predict(x_test)
# Metrics for Training Set
metrics_train = {
'Accuracy': accuracy_score(y_train, y_train_pred),
'Precision': precision_score(y_train, y_train_pred),
'Recall': recall_score(y_train, y_train_pred),
'F1 Score': f1_score(y_train, y_train_pred)
}
# Metrics for Test Set
metrics_test = {
'Accuracy': accuracy_score(y_test, y_test_pred),
'Precision': precision_score(y_test, y_test_pred),
'Recall': recall_score(y_test, y_test_pred),
'F1 Score': f1_score(y_test, y_test_pred)
}
return metrics_train, metrics_test
# Train and evaluate SVM kernels with different C and gamma values
# For linear kernel (no gamma or degree needed)
metrics_linear_train, metrics_linear_test = train_evaluate_svm('linear', x_train, y_train, x_test, y_test, C=1)
# For polynomial kernel (adjust C, gamma, degree)
metrics_poly_train, metrics_poly_test = train_evaluate_svm('poly', x_train, y_train, x_test, y_test, C=1, degree=3, gamma='scale')
# For RBF kernel (adjust C and gamma)
metrics_rbf_train, metrics_rbf_test = train_evaluate_svm('rbf', x_train, y_train, x_test, y_test, C=1, gamma='scale')
# Combine results into a DataFrame for easy comparison
metrics_data = {
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'Linear (Train)': list(metrics_linear_train.values()),
'Linear (Test)': list(metrics_linear_test.values()),
'Polynomial (Train)': list(metrics_poly_train.values()),
'Polynomial (Test)': list(metrics_poly_test.values()),
'RBF (Train)': list(metrics_rbf_train.values()),
'RBF (Test)': list(metrics_rbf_test.values())
}
# Convert to DataFrame and display
combined_metrics_df = pd.DataFrame(metrics_data)
print('===========================================================================================================')
print('COMPARISON OF METRICS FOR LINEAR, POLYNOMIAL, AND RBF KERNELS')
print('===========================================================================================================')
combined_metrics_df
=========================================================================================================== COMPARISON OF METRICS FOR LINEAR, POLYNOMIAL, AND RBF KERNELS ===========================================================================================================
| Metric | Linear (Train) | Linear (Test) | Polynomial (Train) | Polynomial (Test) | RBF (Train) | RBF (Test) | |
|---|---|---|---|---|---|---|---|
| 0 | Accuracy | 0.614583 | 0.612690 | 0.656911 | 0.631857 | 0.645833 | 0.620621 |
| 1 | Precision | 0.609177 | 0.608146 | 0.638755 | 0.625840 | 0.630780 | 0.617130 |
| 2 | Recall | 0.969773 | 0.968680 | 0.965015 | 0.937360 | 0.965855 | 0.942953 |
| 3 | F1 Score | 0.748299 | 0.747196 | 0.768699 | 0.750560 | 0.763158 | 0.746018 |
A smaller difference between the training and test metrics suggests that the Polynomial kernel is less prone to overfitting in our specific case, meaning it generalizes slightly better than rbf, Linear dropped down the list in almost all the metrics compared to others. As our goal is to have a model that performs more consistently across different datasets, Polynomial may be the safer choice.
Hyperparameter Tuning Using K-fold CV¶
We'll perform a grid search to find the best C (regularization) value by cross-validating across multiple values of C.
# Define the parameter grid for C values
param_grid = {'C': [1, 10, 100, 1000]}
# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(SVC(kernel='poly', random_state=1), param_grid, cv=5, scoring='accuracy')
# Perform the grid search on the training data
grid_search.fit(x_train, y_train)
GridSearchCV(cv=5, estimator=SVC(kernel='poly', random_state=1),
param_grid={'C': [1, 10, 100, 1000]}, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SVC(kernel='poly', random_state=1),
param_grid={'C': [1, 10, 100, 1000]}, scoring='accuracy')SVC(kernel='poly', random_state=1)
SVC(kernel='poly', random_state=1)
# Get the best C value found via cross-validation
best_C = grid_search.best_params_['C']
print(f'Best C value: {best_C}')
# Visualize C values vs accuracy changes
# Extract cross-validation results
results = pd.DataFrame(grid_search.cv_results_)
# Plot validation accuracy vs. C values
plt.figure(figsize=(10, 6))
plt.plot(param_grid['C'], results['mean_test_score'], marker='o', label='Mean Test Score (Validation Accuracy)')
plt.xscale('log')
plt.xlabel('C Value (Log Scale)')
plt.ylabel('Mean Cross-Validation Accuracy')
plt.title('Cross-Validation Accuracy for Different C Values (SVM Polynomial Kernel)')
plt.grid(True)
plt.legend()
plt.show()
Best C value: 1
Our results suggest that a moderate amount of regularization (C = 1) balances the bias-variance tradeoff. It prevents overfitting, which could happen with large C values (e.g., 1000 or 5000), and underfitting, which could happen with too small a C.
Optimal Model Fit¶
Let's refit the model with the best C value-
Classification Metrics Using K-fold CV¶
# Initialize the SVM model with polynomial kernel
final_svm_poly = SVC(kernel='poly', C=best_C, random_state=1)
# Set up Stratified K-Fold cross-validation (e.g., 5 folds)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# Perform cross-validation to evaluate performance on the training data
cv_accuracy_scores = cross_val_score(final_svm_poly, x_train, y_train, cv=skf, scoring='accuracy')
cv_precision_scores = cross_val_score(final_svm_poly, x_train, y_train, cv=skf, scoring='precision')
cv_recall_scores = cross_val_score(final_svm_poly, x_train, y_train, cv=skf, scoring='recall')
cv_f1_scores = cross_val_score(final_svm_poly, x_train, y_train, cv=skf, scoring='f1')
# Calculate mean CV scores
cv_accuracy = cv_accuracy_scores.mean()
cv_precision = cv_precision_scores.mean()
cv_recall = cv_recall_scores.mean()
cv_f1 = cv_f1_scores.mean()
# Now, fit the model on the full training data
final_svm_poly.fit(x_train, y_train)
# Predict on the test data
y_test_pred_best = final_svm_poly.predict(x_test)
# Calculate test set metrics
accuracy_test = accuracy_score(y_test, y_test_pred_best)
precision_test = precision_score(y_test, y_test_pred_best, zero_division=0)
recall_test = recall_score(y_test, y_test_pred_best)
f1_test = f1_score(y_test, y_test_pred_best)
# Creating a DataFrame to compare metrics between cross-validated train set and test set
svm_metrics_df = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'SVM Train': [cv_accuracy, cv_precision, cv_recall, cv_f1],
'SVM Test': [accuracy_test, precision_test, recall_test, f1_test]
})
# Display the DataFrame with the performance metrics
print("Performance Metrics of SVM with Polynomial Kernel (Cross-Validated):")
svm_metrics_df
Performance Metrics of SVM with Polynomial Kernel (Cross-Validated):
| Metric | SVM Train | SVM Test | |
|---|---|---|---|
| 0 | Accuracy | 0.630454 | 0.631857 |
| 1 | Precision | 0.624090 | 0.625840 |
| 2 | Recall | 0.942065 | 0.937360 |
| 3 | F1 Score | 0.750757 | 0.750560 |
Confusion Matrix¶
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
# We already have the 'confusion_matrix_with_counts_and_percentage' function from earlier.
# Call the function to display the confusion matrix for SVM
confusion_matrix_with_counts_and_percentage(final_svm_poly, x_test, y_test)
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
array([[118, 501],
[ 56, 838]], dtype=int64)
Feature/Permutation Importance¶
To interpret feature importance for a polynomial kernel SVM, we need to perform permutation importance which shuffles the values of individual features and observe how the model's performance changes. The greater the performance drop when a feature is shuffled, the more important that feature is.
from sklearn.inspection import permutation_importance
# Assuming 'final_svm_poly' is your trained SVM model with polynomial kernel
result = permutation_importance(final_svm_poly, x_test, y_test, n_repeats=10, random_state=1)
# Create a DataFrame for feature importances
perm_importances = pd.DataFrame({
'Feature': x_train.columns, # Feature names
'Importance': result.importances_mean # Mean importance scores
})
# Sort by importance for better readability
perm_importances = perm_importances.sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Set pandas display options to show decimals instead of scientific notation
pd.set_option('display.float_format', '{:.4f}'.format)
# Plot feature importances
plt.figure(figsize=(12, 8))
bars = plt.barh(perm_importances['Feature'], perm_importances['Importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.title('Feature Importances (Permutation Importance)')
plt.show()
# Display the DataFrame with importances
perm_importances
| Feature | Importance | |
|---|---|---|
| 0 | LOYALTY_GROUP_Ocasional | 0.0402 |
| 1 | CUMSALES | 0.0344 |
| 2 | CONSISTENCY | 0.0250 |
| 3 | FREQUENCY | 0.0238 |
| 4 | LOYALTY_GROUP_Loyal | 0.0206 |
| 5 | LOYALTY_GROUP_Split | 0.0199 |
| 6 | AVERAGE_TICKET | 0.0170 |
| 7 | LOYALTY_GROUP_Vip | 0.0089 |
| 8 | GENDER_Female | 0.0057 |
| 9 | MARITAL_STATUS_Single | 0.0039 |
| 10 | AGE | 0.0035 |
| 11 | RECENCY | 0.0035 |
| 12 | GENDER_Male | 0.0034 |
| 13 | MOSTUSED_PLATFORM_Mobile | 0.0019 |
| 14 | PRICE_GROUP_Very Price Insensitive | 0.0018 |
| 15 | MARITAL_STATUS_Married | 0.0014 |
| 16 | PRICE_GROUP_Selective Price Sensitive | 0.0005 |
| 17 | PRICE_GROUP_Very Price Sensitive | 0.0003 |
| 18 | PRICE_GROUP_Moderately Price Insensitive | 0.0002 |
| 19 | PRICE_GROUP_Moderately Price Sensitive | -0.0000 |
| 20 | MARITAL_STATUS_Divorced | -0.0001 |
| 21 | MOSTUSED_PLATFORM_Web | -0.0010 |
3.2.4 Neural Networks¶
! pip install tensorflow
Requirement already satisfied: tensorflow in c:\users\palad\anaconda3\lib\site-packages (2.17.0) Requirement already satisfied: tensorflow-intel==2.17.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow) (2.17.0) Requirement already satisfied: absl-py>=1.0.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (2.1.0) Requirement already satisfied: astunparse>=1.6.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=24.3.25 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (24.3.25) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (0.6.0) Requirement already satisfied: google-pasta>=0.1.1 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (0.2.0) Requirement already satisfied: h5py>=3.10.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (3.12.1) Requirement already satisfied: libclang>=13.0.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (18.1.1) Requirement already satisfied: ml-dtypes<0.5.0,>=0.3.1 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (0.4.1) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (3.4.0) Requirement already satisfied: packaging in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (23.1) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (3.20.3) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (2.31.0) Requirement already satisfied: setuptools in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (68.2.2) Requirement already satisfied: six>=1.12.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (2.5.0) Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (4.9.0) Requirement already satisfied: wrapt>=1.11.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (1.14.1) Requirement already satisfied: grpcio<2.0,>=1.24.3 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (1.66.2) Requirement already satisfied: tensorboard<2.18,>=2.17 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (2.17.1) Requirement already satisfied: keras>=3.2.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (3.6.0) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (0.31.0) Requirement already satisfied: numpy<2.0.0,>=1.23.5 in c:\users\palad\anaconda3\lib\site-packages (from tensorflow-intel==2.17.0->tensorflow) (1.26.4) Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\palad\anaconda3\lib\site-packages (from astunparse>=1.6.0->tensorflow-intel==2.17.0->tensorflow) (0.41.2) Requirement already satisfied: rich in c:\users\palad\anaconda3\lib\site-packages (from keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (13.3.5) Requirement already satisfied: namex in c:\users\palad\anaconda3\lib\site-packages (from keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (0.0.8) Requirement already satisfied: optree in c:\users\palad\anaconda3\lib\site-packages (from keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (0.13.0) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\palad\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorflow-intel==2.17.0->tensorflow) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\palad\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorflow-intel==2.17.0->tensorflow) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\palad\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorflow-intel==2.17.0->tensorflow) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\users\palad\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorflow-intel==2.17.0->tensorflow) (2024.7.4) Requirement already satisfied: markdown>=2.6.8 in c:\users\palad\anaconda3\lib\site-packages (from tensorboard<2.18,>=2.17->tensorflow-intel==2.17.0->tensorflow) (3.4.1) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in c:\users\palad\anaconda3\lib\site-packages (from tensorboard<2.18,>=2.17->tensorflow-intel==2.17.0->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in c:\users\palad\anaconda3\lib\site-packages (from tensorboard<2.18,>=2.17->tensorflow-intel==2.17.0->tensorflow) (2.2.3) Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\palad\anaconda3\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.18,>=2.17->tensorflow-intel==2.17.0->tensorflow) (2.1.3) Requirement already satisfied: markdown-it-py<3.0.0,>=2.2.0 in c:\users\palad\anaconda3\lib\site-packages (from rich->keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (2.2.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\users\palad\anaconda3\lib\site-packages (from rich->keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (2.15.1) Requirement already satisfied: mdurl~=0.1 in c:\users\palad\anaconda3\lib\site-packages (from markdown-it-py<3.0.0,>=2.2.0->rich->keras>=3.2.0->tensorflow-intel==2.17.0->tensorflow) (0.1.0)
# import modules and tools as a classes from Tensorflow for constructing the NN
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
# Other modlues and classes
import random
from tensorflow.keras import backend
import warnings
# Ignore specific UserWarnings (like the one from Keras)
warnings.filterwarnings("ignore", category=UserWarning, module='keras')
Baseline Model Fit & Results¶
This neural network model is designed for binary classification. It has an input layer connected to the first hidden layer, which contains 2 times the number of input features as neurons, using ReLU activation for non-linearity. The output layer has one neuron with a sigmoid activation function to output a probability for binary classification.
The model is compiled with the Adam optimizer for efficient optimization and binary cross-entropy as the loss function, commonly used for binary classification. Accuracy is used as the metric to evaluate the model's performance during training.
# Create the model
model = Sequential()
# Add the input layer and first hidden layer
num_input_features = x_train.shape[1]
num_hidden_neurons = 2 * num_input_features # As per your architecture: 2 * number of input features
model.add(Dense(num_hidden_neurons, activation='relu', input_shape=(num_input_features,)))
# Add the output layer (sigmoid for binary classification)
model.add(Dense(1, activation='sigmoid'))
# Compile the model
model.compile(
optimizer='adam', # Adam optimizer
loss='binary_crossentropy', # Loss function for binary classification
metrics=['accuracy'] # Metrics to track during training
)
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 44) │ 1,012 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 1) │ 45 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,057 (4.13 KB)
Trainable params: 1,057 (4.13 KB)
Non-trainable params: 0 (0.00 B)
I chose 200 epochs to give the model enough time to learn while monitoring performance for early stopping. A batch size of 32 balances speed and resource efficiency. The 20% validation split allows me to track performance on unseen data without a separate test set, ensuring faster execution.
# Train the baseline model with full train data (x_train, y_train)
baseline_history = model.fit(
x_train, # Full training data
y_train,
epochs=200,
batch_size=32,
validation_split=0.2, # Use 20% of the training data as validation
verbose=1
)
Epoch 1/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 1s 1ms/step - accuracy: 0.5356 - loss: 0.6938 - val_accuracy: 0.6149 - val_loss: 0.6594 Epoch 2/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 620us/step - accuracy: 0.5941 - loss: 0.6687 - val_accuracy: 0.6273 - val_loss: 0.6487 Epoch 3/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - accuracy: 0.6269 - loss: 0.6492 - val_accuracy: 0.6223 - val_loss: 0.6424 Epoch 4/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - accuracy: 0.6190 - loss: 0.6479 - val_accuracy: 0.6306 - val_loss: 0.6382 Epoch 5/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 599us/step - accuracy: 0.6163 - loss: 0.6443 - val_accuracy: 0.6248 - val_loss: 0.6366 Epoch 6/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 621us/step - accuracy: 0.6232 - loss: 0.6379 - val_accuracy: 0.6314 - val_loss: 0.6330 Epoch 7/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step - accuracy: 0.6217 - loss: 0.6372 - val_accuracy: 0.6364 - val_loss: 0.6330 Epoch 8/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 630us/step - accuracy: 0.6180 - loss: 0.6397 - val_accuracy: 0.6289 - val_loss: 0.6302 Epoch 9/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 612us/step - accuracy: 0.6310 - loss: 0.6344 - val_accuracy: 0.6289 - val_loss: 0.6302 Epoch 10/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 586us/step - accuracy: 0.6344 - loss: 0.6294 - val_accuracy: 0.6256 - val_loss: 0.6292 Epoch 11/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 612us/step - accuracy: 0.6221 - loss: 0.6267 - val_accuracy: 0.6174 - val_loss: 0.6268 Epoch 12/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step - accuracy: 0.6374 - loss: 0.6230 - val_accuracy: 0.6174 - val_loss: 0.6304 Epoch 13/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 627us/step - accuracy: 0.6316 - loss: 0.6243 - val_accuracy: 0.6281 - val_loss: 0.6241 Epoch 14/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6271 - loss: 0.6276 - val_accuracy: 0.6240 - val_loss: 0.6236 Epoch 15/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6363 - loss: 0.6222 - val_accuracy: 0.6215 - val_loss: 0.6245 Epoch 16/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 592us/step - accuracy: 0.6355 - loss: 0.6189 - val_accuracy: 0.6273 - val_loss: 0.6228 Epoch 17/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6330 - loss: 0.6213 - val_accuracy: 0.6223 - val_loss: 0.6232 Epoch 18/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 619us/step - accuracy: 0.6427 - loss: 0.6151 - val_accuracy: 0.6207 - val_loss: 0.6254 Epoch 19/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 593us/step - accuracy: 0.6250 - loss: 0.6174 - val_accuracy: 0.6298 - val_loss: 0.6217 Epoch 20/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 732us/step - accuracy: 0.6354 - loss: 0.6118 - val_accuracy: 0.6264 - val_loss: 0.6214 Epoch 21/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step - accuracy: 0.6392 - loss: 0.6145 - val_accuracy: 0.6331 - val_loss: 0.6222 Epoch 22/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 637us/step - accuracy: 0.6401 - loss: 0.6083 - val_accuracy: 0.6289 - val_loss: 0.6221 Epoch 23/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 587us/step - accuracy: 0.6426 - loss: 0.6066 - val_accuracy: 0.6314 - val_loss: 0.6191 Epoch 24/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 614us/step - accuracy: 0.6438 - loss: 0.6126 - val_accuracy: 0.6331 - val_loss: 0.6195 Epoch 25/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6533 - loss: 0.5979 - val_accuracy: 0.6190 - val_loss: 0.6227 Epoch 26/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 620us/step - accuracy: 0.6412 - loss: 0.6075 - val_accuracy: 0.6314 - val_loss: 0.6174 Epoch 27/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6480 - loss: 0.6069 - val_accuracy: 0.6314 - val_loss: 0.6199 Epoch 28/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 603us/step - accuracy: 0.6561 - loss: 0.6037 - val_accuracy: 0.6116 - val_loss: 0.6242 Epoch 29/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 594us/step - accuracy: 0.6523 - loss: 0.5946 - val_accuracy: 0.6240 - val_loss: 0.6227 Epoch 30/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 627us/step - accuracy: 0.6482 - loss: 0.6043 - val_accuracy: 0.6264 - val_loss: 0.6192 Epoch 31/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 647us/step - accuracy: 0.6610 - loss: 0.5983 - val_accuracy: 0.6248 - val_loss: 0.6210 Epoch 32/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 729us/step - accuracy: 0.6449 - loss: 0.6081 - val_accuracy: 0.6322 - val_loss: 0.6186 Epoch 33/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 648us/step - accuracy: 0.6604 - loss: 0.5979 - val_accuracy: 0.6264 - val_loss: 0.6193 Epoch 34/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 673us/step - accuracy: 0.6613 - loss: 0.5957 - val_accuracy: 0.6322 - val_loss: 0.6191 Epoch 35/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 680us/step - accuracy: 0.6435 - loss: 0.6047 - val_accuracy: 0.6306 - val_loss: 0.6186 Epoch 36/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 693us/step - accuracy: 0.6394 - loss: 0.6060 - val_accuracy: 0.6339 - val_loss: 0.6182 Epoch 37/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 720us/step - accuracy: 0.6569 - loss: 0.5939 - val_accuracy: 0.6182 - val_loss: 0.6237 Epoch 38/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 693us/step - accuracy: 0.6536 - loss: 0.5990 - val_accuracy: 0.6273 - val_loss: 0.6205 Epoch 39/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 673us/step - accuracy: 0.6665 - loss: 0.5893 - val_accuracy: 0.6223 - val_loss: 0.6204 Epoch 40/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 686us/step - accuracy: 0.6595 - loss: 0.5974 - val_accuracy: 0.6314 - val_loss: 0.6197 Epoch 41/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 667us/step - accuracy: 0.6581 - loss: 0.5945 - val_accuracy: 0.6314 - val_loss: 0.6222 Epoch 42/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 667us/step - accuracy: 0.6601 - loss: 0.5940 - val_accuracy: 0.6165 - val_loss: 0.6212 Epoch 43/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 653us/step - accuracy: 0.6571 - loss: 0.5900 - val_accuracy: 0.6182 - val_loss: 0.6270 Epoch 44/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 673us/step - accuracy: 0.6566 - loss: 0.5918 - val_accuracy: 0.6372 - val_loss: 0.6235 Epoch 45/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 660us/step - accuracy: 0.6589 - loss: 0.5928 - val_accuracy: 0.6281 - val_loss: 0.6242 Epoch 46/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 686us/step - accuracy: 0.6673 - loss: 0.5879 - val_accuracy: 0.6165 - val_loss: 0.6199 Epoch 47/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 686us/step - accuracy: 0.6572 - loss: 0.5947 - val_accuracy: 0.6273 - val_loss: 0.6226 Epoch 48/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 653us/step - accuracy: 0.6626 - loss: 0.5911 - val_accuracy: 0.6264 - val_loss: 0.6199 Epoch 49/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 728us/step - accuracy: 0.6603 - loss: 0.5954 - val_accuracy: 0.6298 - val_loss: 0.6207 Epoch 50/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 686us/step - accuracy: 0.6624 - loss: 0.5855 - val_accuracy: 0.6314 - val_loss: 0.6187 Epoch 51/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 678us/step - accuracy: 0.6686 - loss: 0.5861 - val_accuracy: 0.6256 - val_loss: 0.6175 Epoch 52/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 692us/step - accuracy: 0.6569 - loss: 0.5912 - val_accuracy: 0.6347 - val_loss: 0.6188 Epoch 53/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 653us/step - accuracy: 0.6671 - loss: 0.5860 - val_accuracy: 0.6281 - val_loss: 0.6206 Epoch 54/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 680us/step - accuracy: 0.6645 - loss: 0.5878 - val_accuracy: 0.6190 - val_loss: 0.6240 Epoch 55/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 719us/step - accuracy: 0.6603 - loss: 0.5829 - val_accuracy: 0.6165 - val_loss: 0.6202 Epoch 56/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 687us/step - accuracy: 0.6690 - loss: 0.5860 - val_accuracy: 0.6207 - val_loss: 0.6198 Epoch 57/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 706us/step - accuracy: 0.6526 - loss: 0.5904 - val_accuracy: 0.6157 - val_loss: 0.6236 Epoch 58/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 713us/step - accuracy: 0.6652 - loss: 0.5842 - val_accuracy: 0.6207 - val_loss: 0.6202 Epoch 59/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 726us/step - accuracy: 0.6760 - loss: 0.5776 - val_accuracy: 0.6281 - val_loss: 0.6202 Epoch 60/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 757us/step - accuracy: 0.6824 - loss: 0.5799 - val_accuracy: 0.6273 - val_loss: 0.6233 Epoch 61/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step - accuracy: 0.6676 - loss: 0.5843 - val_accuracy: 0.6256 - val_loss: 0.6184 Epoch 62/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 808us/step - accuracy: 0.6639 - loss: 0.5853 - val_accuracy: 0.6256 - val_loss: 0.6184 Epoch 63/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 749us/step - accuracy: 0.6620 - loss: 0.5869 - val_accuracy: 0.6339 - val_loss: 0.6199 Epoch 64/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 784us/step - accuracy: 0.6624 - loss: 0.5875 - val_accuracy: 0.6281 - val_loss: 0.6190 Epoch 65/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 782us/step - accuracy: 0.6722 - loss: 0.5768 - val_accuracy: 0.6182 - val_loss: 0.6219 Epoch 66/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 750us/step - accuracy: 0.6726 - loss: 0.5736 - val_accuracy: 0.6273 - val_loss: 0.6242 Epoch 67/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 785us/step - accuracy: 0.6788 - loss: 0.5737 - val_accuracy: 0.6174 - val_loss: 0.6225 Epoch 68/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 787us/step - accuracy: 0.6625 - loss: 0.5801 - val_accuracy: 0.6223 - val_loss: 0.6205 Epoch 69/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 797us/step - accuracy: 0.6726 - loss: 0.5764 - val_accuracy: 0.6264 - val_loss: 0.6192 Epoch 70/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 833us/step - accuracy: 0.6712 - loss: 0.5793 - val_accuracy: 0.6248 - val_loss: 0.6220 Epoch 71/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 811us/step - accuracy: 0.6632 - loss: 0.5844 - val_accuracy: 0.6281 - val_loss: 0.6200 Epoch 72/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 778us/step - accuracy: 0.6814 - loss: 0.5719 - val_accuracy: 0.6248 - val_loss: 0.6224 Epoch 73/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 697us/step - accuracy: 0.6692 - loss: 0.5776 - val_accuracy: 0.6306 - val_loss: 0.6188 Epoch 74/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 733us/step - accuracy: 0.6831 - loss: 0.5719 - val_accuracy: 0.6140 - val_loss: 0.6228 Epoch 75/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 689us/step - accuracy: 0.6740 - loss: 0.5769 - val_accuracy: 0.6248 - val_loss: 0.6199 Epoch 76/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 767us/step - accuracy: 0.6719 - loss: 0.5758 - val_accuracy: 0.6215 - val_loss: 0.6224 Epoch 77/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 718us/step - accuracy: 0.6807 - loss: 0.5781 - val_accuracy: 0.6231 - val_loss: 0.6219 Epoch 78/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 731us/step - accuracy: 0.6776 - loss: 0.5796 - val_accuracy: 0.6174 - val_loss: 0.6236 Epoch 79/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step - accuracy: 0.6697 - loss: 0.5818 - val_accuracy: 0.6264 - val_loss: 0.6199 Epoch 80/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 756us/step - accuracy: 0.6648 - loss: 0.5775 - val_accuracy: 0.6289 - val_loss: 0.6221 Epoch 81/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step - accuracy: 0.6812 - loss: 0.5733 - val_accuracy: 0.6264 - val_loss: 0.6283 Epoch 82/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 769us/step - accuracy: 0.6837 - loss: 0.5677 - val_accuracy: 0.6281 - val_loss: 0.6233 Epoch 83/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 865us/step - accuracy: 0.6765 - loss: 0.5725 - val_accuracy: 0.6190 - val_loss: 0.6199 Epoch 84/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 795us/step - accuracy: 0.6730 - loss: 0.5773 - val_accuracy: 0.6264 - val_loss: 0.6248 Epoch 85/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 804us/step - accuracy: 0.6762 - loss: 0.5791 - val_accuracy: 0.6240 - val_loss: 0.6231 Epoch 86/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 783us/step - accuracy: 0.6815 - loss: 0.5701 - val_accuracy: 0.6322 - val_loss: 0.6253 Epoch 87/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 735us/step - accuracy: 0.6769 - loss: 0.5726 - val_accuracy: 0.6215 - val_loss: 0.6258 Epoch 88/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - accuracy: 0.6564 - loss: 0.5786 - val_accuracy: 0.6298 - val_loss: 0.6219 Epoch 89/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 688us/step - accuracy: 0.6772 - loss: 0.5734 - val_accuracy: 0.6207 - val_loss: 0.6220 Epoch 90/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 788us/step - accuracy: 0.6812 - loss: 0.5682 - val_accuracy: 0.6190 - val_loss: 0.6242 Epoch 91/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 735us/step - accuracy: 0.6790 - loss: 0.5758 - val_accuracy: 0.6322 - val_loss: 0.6216 Epoch 92/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 740us/step - accuracy: 0.6758 - loss: 0.5691 - val_accuracy: 0.6331 - val_loss: 0.6208 Epoch 93/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 781us/step - accuracy: 0.6673 - loss: 0.5787 - val_accuracy: 0.6289 - val_loss: 0.6245 Epoch 94/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 800us/step - accuracy: 0.6612 - loss: 0.5784 - val_accuracy: 0.6223 - val_loss: 0.6262 Epoch 95/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 766us/step - accuracy: 0.6664 - loss: 0.5757 - val_accuracy: 0.6215 - val_loss: 0.6261 Epoch 96/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 790us/step - accuracy: 0.6719 - loss: 0.5765 - val_accuracy: 0.6355 - val_loss: 0.6211 Epoch 97/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 752us/step - accuracy: 0.6735 - loss: 0.5765 - val_accuracy: 0.6223 - val_loss: 0.6309 Epoch 98/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 749us/step - accuracy: 0.6858 - loss: 0.5675 - val_accuracy: 0.6306 - val_loss: 0.6227 Epoch 99/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 759us/step - accuracy: 0.6818 - loss: 0.5680 - val_accuracy: 0.6240 - val_loss: 0.6226 Epoch 100/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 700us/step - accuracy: 0.6753 - loss: 0.5686 - val_accuracy: 0.6298 - val_loss: 0.6249 Epoch 101/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 867us/step - accuracy: 0.6876 - loss: 0.5710 - val_accuracy: 0.5992 - val_loss: 0.6325 Epoch 102/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 785us/step - accuracy: 0.6808 - loss: 0.5658 - val_accuracy: 0.6149 - val_loss: 0.6245 Epoch 103/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - accuracy: 0.6805 - loss: 0.5683 - val_accuracy: 0.6289 - val_loss: 0.6213 Epoch 104/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 848us/step - accuracy: 0.6701 - loss: 0.5695 - val_accuracy: 0.6273 - val_loss: 0.6224 Epoch 105/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 845us/step - accuracy: 0.6855 - loss: 0.5704 - val_accuracy: 0.6165 - val_loss: 0.6272 Epoch 106/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6846 - loss: 0.5690 - val_accuracy: 0.6289 - val_loss: 0.6241 Epoch 107/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6861 - loss: 0.5577 - val_accuracy: 0.6207 - val_loss: 0.6251 Epoch 108/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 858us/step - accuracy: 0.6790 - loss: 0.5667 - val_accuracy: 0.6281 - val_loss: 0.6246 Epoch 109/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step - accuracy: 0.6758 - loss: 0.5698 - val_accuracy: 0.6256 - val_loss: 0.6227 Epoch 110/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - accuracy: 0.6720 - loss: 0.5691 - val_accuracy: 0.6207 - val_loss: 0.6235 Epoch 111/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 739us/step - accuracy: 0.6907 - loss: 0.5657 - val_accuracy: 0.6116 - val_loss: 0.6301 Epoch 112/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 752us/step - accuracy: 0.6667 - loss: 0.5744 - val_accuracy: 0.6298 - val_loss: 0.6215 Epoch 113/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 752us/step - accuracy: 0.6841 - loss: 0.5644 - val_accuracy: 0.6240 - val_loss: 0.6225 Epoch 114/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 766us/step - accuracy: 0.6726 - loss: 0.5763 - val_accuracy: 0.6355 - val_loss: 0.6229 Epoch 115/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 786us/step - accuracy: 0.6943 - loss: 0.5603 - val_accuracy: 0.6240 - val_loss: 0.6229 Epoch 116/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6872 - loss: 0.5609 - val_accuracy: 0.6207 - val_loss: 0.6225 Epoch 117/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6923 - loss: 0.5645 - val_accuracy: 0.6223 - val_loss: 0.6235 Epoch 118/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 663us/step - accuracy: 0.6792 - loss: 0.5685 - val_accuracy: 0.6231 - val_loss: 0.6255 Epoch 119/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 799us/step - accuracy: 0.6702 - loss: 0.5680 - val_accuracy: 0.6256 - val_loss: 0.6233 Epoch 120/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - accuracy: 0.6801 - loss: 0.5602 - val_accuracy: 0.6264 - val_loss: 0.6229 Epoch 121/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 898us/step - accuracy: 0.6860 - loss: 0.5625 - val_accuracy: 0.6273 - val_loss: 0.6217 Epoch 122/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 851us/step - accuracy: 0.6860 - loss: 0.5592 - val_accuracy: 0.6223 - val_loss: 0.6269 Epoch 123/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 851us/step - accuracy: 0.6784 - loss: 0.5652 - val_accuracy: 0.6017 - val_loss: 0.6303 Epoch 124/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 818us/step - accuracy: 0.6959 - loss: 0.5601 - val_accuracy: 0.6248 - val_loss: 0.6218 Epoch 125/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 779us/step - accuracy: 0.6838 - loss: 0.5580 - val_accuracy: 0.6289 - val_loss: 0.6225 Epoch 126/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 805us/step - accuracy: 0.6848 - loss: 0.5616 - val_accuracy: 0.6215 - val_loss: 0.6261 Epoch 127/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 805us/step - accuracy: 0.6859 - loss: 0.5645 - val_accuracy: 0.6264 - val_loss: 0.6244 Epoch 128/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 805us/step - accuracy: 0.6940 - loss: 0.5580 - val_accuracy: 0.6190 - val_loss: 0.6257 Epoch 129/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 799us/step - accuracy: 0.6816 - loss: 0.5696 - val_accuracy: 0.6157 - val_loss: 0.6242 Epoch 130/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 811us/step - accuracy: 0.6856 - loss: 0.5616 - val_accuracy: 0.6322 - val_loss: 0.6274 Epoch 131/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 805us/step - accuracy: 0.6962 - loss: 0.5590 - val_accuracy: 0.6215 - val_loss: 0.6252 Epoch 132/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 779us/step - accuracy: 0.6797 - loss: 0.5653 - val_accuracy: 0.6174 - val_loss: 0.6261 Epoch 133/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step - accuracy: 0.6904 - loss: 0.5642 - val_accuracy: 0.6240 - val_loss: 0.6264 Epoch 134/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 766us/step - accuracy: 0.6826 - loss: 0.5643 - val_accuracy: 0.6190 - val_loss: 0.6248 Epoch 135/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6660 - loss: 0.5731 - val_accuracy: 0.6207 - val_loss: 0.6264 Epoch 136/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 693us/step - accuracy: 0.6814 - loss: 0.5622 - val_accuracy: 0.6190 - val_loss: 0.6310 Epoch 137/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 825us/step - accuracy: 0.6817 - loss: 0.5610 - val_accuracy: 0.6240 - val_loss: 0.6267 Epoch 138/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 789us/step - accuracy: 0.6739 - loss: 0.5718 - val_accuracy: 0.6314 - val_loss: 0.6321 Epoch 139/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 845us/step - accuracy: 0.6870 - loss: 0.5637 - val_accuracy: 0.6306 - val_loss: 0.6263 Epoch 140/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 917us/step - accuracy: 0.6887 - loss: 0.5561 - val_accuracy: 0.6182 - val_loss: 0.6303 Epoch 141/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 811us/step - accuracy: 0.6935 - loss: 0.5618 - val_accuracy: 0.6306 - val_loss: 0.6265 Epoch 142/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 819us/step - accuracy: 0.6921 - loss: 0.5563 - val_accuracy: 0.6165 - val_loss: 0.6272 Epoch 143/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 879us/step - accuracy: 0.7017 - loss: 0.5497 - val_accuracy: 0.6273 - val_loss: 0.6259 Epoch 144/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 870us/step - accuracy: 0.6906 - loss: 0.5563 - val_accuracy: 0.6140 - val_loss: 0.6294 Epoch 145/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 838us/step - accuracy: 0.6931 - loss: 0.5558 - val_accuracy: 0.6116 - val_loss: 0.6317 Epoch 146/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 819us/step - accuracy: 0.6992 - loss: 0.5583 - val_accuracy: 0.6215 - val_loss: 0.6323 Epoch 147/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 868us/step - accuracy: 0.6953 - loss: 0.5601 - val_accuracy: 0.6107 - val_loss: 0.6300 Epoch 148/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 839us/step - accuracy: 0.6846 - loss: 0.5652 - val_accuracy: 0.6182 - val_loss: 0.6309 Epoch 149/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 812us/step - accuracy: 0.6950 - loss: 0.5550 - val_accuracy: 0.6281 - val_loss: 0.6277 Epoch 150/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 833us/step - accuracy: 0.6899 - loss: 0.5616 - val_accuracy: 0.6000 - val_loss: 0.6381 Epoch 151/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 816us/step - accuracy: 0.6988 - loss: 0.5553 - val_accuracy: 0.6207 - val_loss: 0.6267 Epoch 152/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 799us/step - accuracy: 0.6916 - loss: 0.5550 - val_accuracy: 0.6306 - val_loss: 0.6286 Epoch 153/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 805us/step - accuracy: 0.6782 - loss: 0.5662 - val_accuracy: 0.6264 - val_loss: 0.6268 Epoch 154/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 797us/step - accuracy: 0.6809 - loss: 0.5638 - val_accuracy: 0.6215 - val_loss: 0.6276 Epoch 155/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 807us/step - accuracy: 0.6815 - loss: 0.5681 - val_accuracy: 0.6215 - val_loss: 0.6282 Epoch 156/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 873us/step - accuracy: 0.7038 - loss: 0.5486 - val_accuracy: 0.6207 - val_loss: 0.6267 Epoch 157/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 846us/step - accuracy: 0.6784 - loss: 0.5551 - val_accuracy: 0.6248 - val_loss: 0.6292 Epoch 158/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 859us/step - accuracy: 0.7005 - loss: 0.5478 - val_accuracy: 0.6240 - val_loss: 0.6324 Epoch 159/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 825us/step - accuracy: 0.6950 - loss: 0.5545 - val_accuracy: 0.6198 - val_loss: 0.6319 Epoch 160/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 870us/step - accuracy: 0.6889 - loss: 0.5571 - val_accuracy: 0.6174 - val_loss: 0.6303 Epoch 161/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 799us/step - accuracy: 0.6935 - loss: 0.5513 - val_accuracy: 0.6231 - val_loss: 0.6296 Epoch 162/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 834us/step - accuracy: 0.6918 - loss: 0.5615 - val_accuracy: 0.6256 - val_loss: 0.6305 Epoch 163/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 818us/step - accuracy: 0.6913 - loss: 0.5537 - val_accuracy: 0.6273 - val_loss: 0.6312 Epoch 164/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 859us/step - accuracy: 0.6953 - loss: 0.5491 - val_accuracy: 0.6116 - val_loss: 0.6455 Epoch 165/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 876us/step - accuracy: 0.6924 - loss: 0.5537 - val_accuracy: 0.6174 - val_loss: 0.6312 Epoch 166/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 827us/step - accuracy: 0.6901 - loss: 0.5514 - val_accuracy: 0.6058 - val_loss: 0.6326 Epoch 167/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 794us/step - accuracy: 0.6827 - loss: 0.5631 - val_accuracy: 0.6273 - val_loss: 0.6308 Epoch 168/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 898us/step - accuracy: 0.6849 - loss: 0.5576 - val_accuracy: 0.6215 - val_loss: 0.6311 Epoch 169/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 888us/step - accuracy: 0.6897 - loss: 0.5557 - val_accuracy: 0.6198 - val_loss: 0.6312 Epoch 170/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 921us/step - accuracy: 0.6991 - loss: 0.5546 - val_accuracy: 0.6240 - val_loss: 0.6296 Epoch 171/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 943us/step - accuracy: 0.6821 - loss: 0.5563 - val_accuracy: 0.6215 - val_loss: 0.6287 Epoch 172/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 935us/step - accuracy: 0.6835 - loss: 0.5586 - val_accuracy: 0.6207 - val_loss: 0.6310 Epoch 173/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 939us/step - accuracy: 0.6928 - loss: 0.5511 - val_accuracy: 0.6140 - val_loss: 0.6349 Epoch 174/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 924us/step - accuracy: 0.6802 - loss: 0.5655 - val_accuracy: 0.6107 - val_loss: 0.6345 Epoch 175/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 847us/step - accuracy: 0.6982 - loss: 0.5510 - val_accuracy: 0.6033 - val_loss: 0.6386 Epoch 176/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 861us/step - accuracy: 0.6837 - loss: 0.5608 - val_accuracy: 0.6182 - val_loss: 0.6310 Epoch 177/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 898us/step - accuracy: 0.6921 - loss: 0.5527 - val_accuracy: 0.6215 - val_loss: 0.6322 Epoch 178/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 871us/step - accuracy: 0.7018 - loss: 0.5533 - val_accuracy: 0.6215 - val_loss: 0.6304 Epoch 179/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 847us/step - accuracy: 0.7093 - loss: 0.5495 - val_accuracy: 0.6165 - val_loss: 0.6301 Epoch 180/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 853us/step - accuracy: 0.7052 - loss: 0.5408 - val_accuracy: 0.6256 - val_loss: 0.6339 Epoch 181/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 857us/step - accuracy: 0.6977 - loss: 0.5500 - val_accuracy: 0.6157 - val_loss: 0.6322 Epoch 182/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 887us/step - accuracy: 0.6865 - loss: 0.5528 - val_accuracy: 0.6008 - val_loss: 0.6452 Epoch 183/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 912us/step - accuracy: 0.6882 - loss: 0.5566 - val_accuracy: 0.6256 - val_loss: 0.6292 Epoch 184/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 927us/step - accuracy: 0.6975 - loss: 0.5486 - val_accuracy: 0.6041 - val_loss: 0.6348 Epoch 185/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 889us/step - accuracy: 0.7052 - loss: 0.5461 - val_accuracy: 0.6215 - val_loss: 0.6305 Epoch 186/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 860us/step - accuracy: 0.6996 - loss: 0.5504 - val_accuracy: 0.6050 - val_loss: 0.6423 Epoch 187/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 868us/step - accuracy: 0.7021 - loss: 0.5458 - val_accuracy: 0.6190 - val_loss: 0.6307 Epoch 188/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 898us/step - accuracy: 0.6803 - loss: 0.5632 - val_accuracy: 0.6124 - val_loss: 0.6371 Epoch 189/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 865us/step - accuracy: 0.6950 - loss: 0.5584 - val_accuracy: 0.6174 - val_loss: 0.6332 Epoch 190/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 832us/step - accuracy: 0.6936 - loss: 0.5487 - val_accuracy: 0.6190 - val_loss: 0.6346 Epoch 191/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 845us/step - accuracy: 0.7018 - loss: 0.5472 - val_accuracy: 0.6231 - val_loss: 0.6341 Epoch 192/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 892us/step - accuracy: 0.6838 - loss: 0.5593 - val_accuracy: 0.6231 - val_loss: 0.6354 Epoch 193/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 882us/step - accuracy: 0.7092 - loss: 0.5370 - val_accuracy: 0.6174 - val_loss: 0.6371 Epoch 194/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 835us/step - accuracy: 0.6911 - loss: 0.5536 - val_accuracy: 0.6190 - val_loss: 0.6378 Epoch 195/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 898us/step - accuracy: 0.7089 - loss: 0.5459 - val_accuracy: 0.6198 - val_loss: 0.6327 Epoch 196/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 911us/step - accuracy: 0.6965 - loss: 0.5510 - val_accuracy: 0.6157 - val_loss: 0.6342 Epoch 197/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 928us/step - accuracy: 0.6895 - loss: 0.5510 - val_accuracy: 0.6174 - val_loss: 0.6357 Epoch 198/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 870us/step - accuracy: 0.7064 - loss: 0.5440 - val_accuracy: 0.6149 - val_loss: 0.6373 Epoch 199/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 925us/step - accuracy: 0.6950 - loss: 0.5527 - val_accuracy: 0.6298 - val_loss: 0.6431 Epoch 200/200 152/152 ━━━━━━━━━━━━━━━━━━━━ 0s 889us/step - accuracy: 0.6922 - loss: 0.5577 - val_accuracy: 0.6124 - val_loss: 0.6342
# Plot training and validation accuracy over epochs
plt.plot(baseline_history.history['accuracy'], label='Training Accuracy')
plt.plot(baseline_history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Epochs')
plt.legend()
plt.grid(True) # Add grid for better readability
plt.show()
# Evaluate the baseline model on the full test set
baseline_loss, baseline_accuracy = model.evaluate(x_test, y_test, verbose=1)
print(f'Baseline Test Loss: {baseline_loss:.4f}')
print(f'Baseline Test Accuracy: {baseline_accuracy:.4f}')
48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 700us/step - accuracy: 0.5681 - loss: 0.6714 Baseline Test Loss: 0.6681 Baseline Test Accuracy: 0.5849
This gap between training accuracy (70%) and validation (little less than 60%)) represents overfitting, we need to tune parameters to find the optimal model with consistent performance.
Subset for Tuning¶
# Fixing the seed for random number generators for reproducibility
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
# Step 1: Subset data (50%) from the original train data (x_train and y_train remain the same)
x_train_subset, _, y_train_subset, _ = train_test_split(x_train, y_train, test_size=0.5, stratify=y_train, random_state=1)
# Step 2: Further split this subset into a training set and test set for tuning
x_train_subset_from_train, x_test_subset_from_train, y_train_subset_from_train, y_test_subset_from_train = train_test_split(
x_train_subset, y_train_subset, test_size=0.3, stratify=y_train_subset, random_state=1)
# Verify the shape of the subset
print(f"Original Train size (which is 80% of main data): {x_train.shape[0]} records")
print(f"Subset size taken from training set (which is 50% of train data): {x_train_subset.shape[0]} records")
print(f"Train Subset size (which is 70% of subset): {x_train_subset_from_train.shape[0]} records")
print(f"Test Subset size (which is 30% of subset): {x_test_subset_from_train.shape[0]} records")
Original Train size (which is 80% of main data): 6048 records Subset size taken from training set (which is 50% of train data): 3024 records Train Subset size (which is 70% of subset): 2116 records Test Subset size (which is 30% of subset): 908 records
Subsetting the Training Data for Hyperparameter Tuning: A subset of the training data (50%) was selected to perform hyperparameter tuning. With the model going to be used being very time-consuming this approach reduces computational overhead, making the tuning process faster without sacrificing model quality. Stratification was applied during this step as well to maintain class balance in the subset, ensuring the model's performance during tuning is representative of the overall data distribution.
Further Split for Cross-Validation or Tuning: The subset of the training data is then split into 70% training and 30% validation (for cross-validation or model tuning). This ensures that hyperparameter tuning is performed in a controlled environment with a training and validation set that is representative of the overall data, maintaining the integrity of the model evaluation during tuning.
Final Model Training and Evaluation: Once the best hyperparameters are found using the subset, the final model will be trained on the full training set (x_train, y_train) and evaluated on the test set (x_test, y_test). This guarantees that the model's final evaluation is done on the full data, ensuring that the model generalizes well across the entire dataset.
Note- nowhere original test set is touched to make it usable for getting valid final model performance results at the end.
Hyperparameter Tuning Using K-fold CV¶
I use the subset (both for training and testing) drawn from train set alone to tune hyperparameters because of computational limits and time constraints.
# Clear backend
backend.clear_session()
# Fix the seed
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)
WARNING:tensorflow:From C:\Users\palad\anaconda3\Lib\site-packages\keras\src\backend\common\global_state.py:82: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.
# Step 1: Subset data (3.5k) from the original train data (x_train and y_train remain the same)
x_train_subset, _, y_train_subset, _ = train_test_split(x_train, y_train, test_size=0.5, stratify=y_train, random_state=1)
# Step 2: Further split this subset into a training set and test set for tuning
x_train_subset_from_train, x_test_subset_from_train, y_train_subset_from_train, y_test_subset_from_train = train_test_split(
x_train_subset, y_train_subset, test_size=0.3, stratify=y_train_subset, random_state=1)
# Convert to NumPy arrays if they're in DataFrame format
x_train_subset_from_train = x_train_subset_from_train.values
y_train_subset_from_train = y_train_subset_from_train.values
x_test_subset_from_train = x_test_subset_from_train.values
y_test_subset_from_train = y_test_subset_from_train.values
# Step 3: Define the model creation function
def create_model(n_layers, n_neurons, input_shape):
model = Sequential()
model.add(Dense(n_neurons, activation='relu', input_shape=(input_shape,)))
for _ in range(1, n_layers):
model.add(Dense(n_neurons, activation='relu'))
model.add(Dense(1, activation='sigmoid')) # Assuming binary classification
return model
# Step 4: Hyperparameter tuning using StratifiedKFold on the smaller train subset
def tune_model_with_cv(x_train_subset_from_train, y_train_subset_from_train, n_splits=5, epochs=200):
results = []
max_accuracy = 0
optimal_layers = 0
optimal_neurons = 0
# Stratified K-Fold to preserve class balance
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1)
# Limiting neurons to input features and double that
for n_neurons in [x_train_subset_from_train.shape[1], x_train_subset_from_train.shape[1] * 2]:
for n_layers in range(1, 3): # Limit layers to 1 or 2
fold_accuracies = []
for train_idx, val_idx in skf.split(x_train_subset_from_train, y_train_subset_from_train):
model = create_model(n_layers, n_neurons, input_shape=x_train_subset_from_train.shape[1])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(x_train_subset_from_train[train_idx], y_train_subset_from_train[train_idx], epochs=epochs, verbose=0)
# Evaluate on the validation fold
_, val_accuracy = model.evaluate(x_train_subset_from_train[val_idx], y_train_subset_from_train[val_idx], verbose=0)
fold_accuracies.append(val_accuracy)
avg_accuracy = np.mean(fold_accuracies)
results.append((n_layers, n_neurons, avg_accuracy))
if avg_accuracy > max_accuracy:
max_accuracy = avg_accuracy
optimal_layers = n_layers
optimal_neurons = n_neurons
print(f"New optimal found: Accuracy={max_accuracy:.4f}, Layers={n_layers}, Neurons={n_neurons}")
return max_accuracy, optimal_layers, optimal_neurons, results
# Step 5: Evaluate on test subset from train data
def evaluate_on_test_subset(best_layers, best_neurons):
model = create_model(best_layers, best_neurons, input_shape=x_train_subset_from_train.shape[1])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train on the full train subset
model.fit(x_train_subset_from_train, y_train_subset_from_train, epochs=200, verbose=1)
# Evaluate on the test subset from train data
_, test_accuracy = model.evaluate(x_test_subset_from_train, y_test_subset_from_train, verbose=1)
print(f"Accuracy on the Test Subset of Train Data: {test_accuracy:.4f}")
return model
from sklearn.model_selection import StratifiedKFold
# Running the tuning process
max_acc, layers, neurons, results = tune_model_with_cv(x_train_subset_from_train, y_train_subset_from_train)
New optimal found: Accuracy=0.5945, Layers=1, Neurons=22
# Evaluating on test subset of the train data
best_model = evaluate_on_test_subset(layers, neurons)
Epoch 1/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 1s 775us/step - accuracy: 0.5361 - loss: 0.7039 Epoch 2/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 487us/step - accuracy: 0.5619 - loss: 0.6838 Epoch 3/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 543us/step - accuracy: 0.5817 - loss: 0.6747 Epoch 4/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 476us/step - accuracy: 0.5952 - loss: 0.6691 Epoch 5/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 466us/step - accuracy: 0.5984 - loss: 0.6649 Epoch 6/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 502us/step - accuracy: 0.6079 - loss: 0.6620 Epoch 7/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 489us/step - accuracy: 0.6074 - loss: 0.6601 Epoch 8/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 506us/step - accuracy: 0.6024 - loss: 0.6571 Epoch 9/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 491us/step - accuracy: 0.6015 - loss: 0.6551 Epoch 10/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 514us/step - accuracy: 0.6052 - loss: 0.6541 Epoch 11/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 472us/step - accuracy: 0.6038 - loss: 0.6517 Epoch 12/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 461us/step - accuracy: 0.6024 - loss: 0.6526 Epoch 13/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 474us/step - accuracy: 0.6087 - loss: 0.6488 Epoch 14/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 483us/step - accuracy: 0.6080 - loss: 0.6485 Epoch 15/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 504us/step - accuracy: 0.6133 - loss: 0.6459 Epoch 16/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 513us/step - accuracy: 0.6118 - loss: 0.6447 Epoch 17/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 584us/step - accuracy: 0.6084 - loss: 0.6447 Epoch 18/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 563us/step - accuracy: 0.6130 - loss: 0.6421 Epoch 19/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 704us/step - accuracy: 0.6115 - loss: 0.6408 Epoch 20/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 831us/step - accuracy: 0.6099 - loss: 0.6413 Epoch 21/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 714us/step - accuracy: 0.6119 - loss: 0.6393 Epoch 22/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 680us/step - accuracy: 0.6150 - loss: 0.6374 Epoch 23/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 836us/step - accuracy: 0.6117 - loss: 0.6362 Epoch 24/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 615us/step - accuracy: 0.6157 - loss: 0.6350 Epoch 25/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 652us/step - accuracy: 0.6143 - loss: 0.6347 Epoch 26/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 533us/step - accuracy: 0.6165 - loss: 0.6327 Epoch 27/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 549us/step - accuracy: 0.6175 - loss: 0.6316 Epoch 28/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 529us/step - accuracy: 0.6184 - loss: 0.6305 Epoch 29/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 572us/step - accuracy: 0.6162 - loss: 0.6294 Epoch 30/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6179 - loss: 0.6283 Epoch 31/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 566us/step - accuracy: 0.6185 - loss: 0.6273 Epoch 32/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step - accuracy: 0.6191 - loss: 0.6264 Epoch 33/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.6173 - loss: 0.6270 Epoch 34/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 604us/step - accuracy: 0.6222 - loss: 0.6245 Epoch 35/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 574us/step - accuracy: 0.6231 - loss: 0.6237 Epoch 36/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 540us/step - accuracy: 0.6251 - loss: 0.6228 Epoch 37/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 569us/step - accuracy: 0.6221 - loss: 0.6219 Epoch 38/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 581us/step - accuracy: 0.6226 - loss: 0.6211 Epoch 39/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6235 - loss: 0.6202 Epoch 40/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 590us/step - accuracy: 0.6278 - loss: 0.6194 Epoch 41/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6320 - loss: 0.6185 Epoch 42/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 594us/step - accuracy: 0.6313 - loss: 0.6186 Epoch 43/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step - accuracy: 0.6369 - loss: 0.6168 Epoch 44/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 589us/step - accuracy: 0.6403 - loss: 0.6161 Epoch 45/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 684us/step - accuracy: 0.6388 - loss: 0.6160 Epoch 46/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 641us/step - accuracy: 0.6402 - loss: 0.6145 Epoch 47/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 654us/step - accuracy: 0.6385 - loss: 0.6144 Epoch 48/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 670us/step - accuracy: 0.6381 - loss: 0.6140 Epoch 49/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 648us/step - accuracy: 0.6372 - loss: 0.6129 Epoch 50/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 621us/step - accuracy: 0.6394 - loss: 0.6124 Epoch 51/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 623us/step - accuracy: 0.6397 - loss: 0.6116 Epoch 52/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 643us/step - accuracy: 0.6405 - loss: 0.6102 Epoch 53/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 833us/step - accuracy: 0.6395 - loss: 0.6096 Epoch 54/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 606us/step - accuracy: 0.6388 - loss: 0.6090 Epoch 55/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 621us/step - accuracy: 0.6394 - loss: 0.6090 Epoch 56/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 624us/step - accuracy: 0.6405 - loss: 0.6091 Epoch 57/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6423 - loss: 0.6073 Epoch 58/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 710us/step - accuracy: 0.6450 - loss: 0.6067 Epoch 59/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 550us/step - accuracy: 0.6442 - loss: 0.6060 Epoch 60/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step - accuracy: 0.6451 - loss: 0.6061 Epoch 61/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 590us/step - accuracy: 0.6465 - loss: 0.6048 Epoch 62/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 631us/step - accuracy: 0.6452 - loss: 0.6050 Epoch 63/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 575us/step - accuracy: 0.6443 - loss: 0.6036 Epoch 64/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 692us/step - accuracy: 0.6461 - loss: 0.6031 Epoch 65/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 574us/step - accuracy: 0.6448 - loss: 0.6024 Epoch 66/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 570us/step - accuracy: 0.6443 - loss: 0.6017 Epoch 67/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 617us/step - accuracy: 0.6466 - loss: 0.6019 Epoch 68/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 578us/step - accuracy: 0.6483 - loss: 0.6006 Epoch 69/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 733us/step - accuracy: 0.6501 - loss: 0.6003 Epoch 70/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 626us/step - accuracy: 0.6470 - loss: 0.5995 Epoch 71/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6473 - loss: 0.5989 Epoch 72/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 649us/step - accuracy: 0.6463 - loss: 0.5984 Epoch 73/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 655us/step - accuracy: 0.6482 - loss: 0.5978 Epoch 74/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 716us/step - accuracy: 0.6466 - loss: 0.5977 Epoch 75/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 695us/step - accuracy: 0.6483 - loss: 0.5967 Epoch 76/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.6493 - loss: 0.5962 Epoch 77/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 643us/step - accuracy: 0.6497 - loss: 0.5964 Epoch 78/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 649us/step - accuracy: 0.6493 - loss: 0.5957 Epoch 79/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 655us/step - accuracy: 0.6484 - loss: 0.5954 Epoch 80/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 722us/step - accuracy: 0.6473 - loss: 0.5950 Epoch 81/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6471 - loss: 0.5939 Epoch 82/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 793us/step - accuracy: 0.6469 - loss: 0.5934 Epoch 83/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 813us/step - accuracy: 0.6494 - loss: 0.5934 Epoch 84/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 662us/step - accuracy: 0.6480 - loss: 0.5924 Epoch 85/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 619us/step - accuracy: 0.6513 - loss: 0.5920 Epoch 86/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 619us/step - accuracy: 0.6524 - loss: 0.5916 Epoch 87/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 598us/step - accuracy: 0.6568 - loss: 0.5911 Epoch 88/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 733us/step - accuracy: 0.6534 - loss: 0.5912 Epoch 89/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 645us/step - accuracy: 0.6528 - loss: 0.5902 Epoch 90/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 607us/step - accuracy: 0.6550 - loss: 0.5897 Epoch 91/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 616us/step - accuracy: 0.6563 - loss: 0.5893 Epoch 92/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 628us/step - accuracy: 0.6533 - loss: 0.5899 Epoch 93/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 592us/step - accuracy: 0.6555 - loss: 0.5893 Epoch 94/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 642us/step - accuracy: 0.6587 - loss: 0.5882 Epoch 95/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 608us/step - accuracy: 0.6566 - loss: 0.5885 Epoch 96/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 606us/step - accuracy: 0.6599 - loss: 0.5876 Epoch 97/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 585us/step - accuracy: 0.6611 - loss: 0.5867 Epoch 98/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 652us/step - accuracy: 0.6628 - loss: 0.5863 Epoch 99/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 664us/step - accuracy: 0.6587 - loss: 0.5864 Epoch 100/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 647us/step - accuracy: 0.6581 - loss: 0.5860 Epoch 101/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 605us/step - accuracy: 0.6637 - loss: 0.5850 Epoch 102/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6631 - loss: 0.5850 Epoch 103/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step - accuracy: 0.6639 - loss: 0.5840 Epoch 104/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 675us/step - accuracy: 0.6636 - loss: 0.5835 Epoch 105/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6621 - loss: 0.5837 Epoch 106/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6661 - loss: 0.5832 Epoch 107/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 587us/step - accuracy: 0.6702 - loss: 0.5823 Epoch 108/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 654us/step - accuracy: 0.6714 - loss: 0.5830 Epoch 109/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 675us/step - accuracy: 0.6711 - loss: 0.5818 Epoch 110/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 670us/step - accuracy: 0.6731 - loss: 0.5812 Epoch 111/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 760us/step - accuracy: 0.6714 - loss: 0.5812 Epoch 112/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 719us/step - accuracy: 0.6718 - loss: 0.5814 Epoch 113/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 777us/step - accuracy: 0.6721 - loss: 0.5802 Epoch 114/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 717us/step - accuracy: 0.6725 - loss: 0.5800 Epoch 115/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 715us/step - accuracy: 0.6737 - loss: 0.5796 Epoch 116/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 708us/step - accuracy: 0.6733 - loss: 0.5793 Epoch 117/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 854us/step - accuracy: 0.6743 - loss: 0.5790 Epoch 118/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 827us/step - accuracy: 0.6739 - loss: 0.5787 Epoch 119/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 744us/step - accuracy: 0.6755 - loss: 0.5784 Epoch 120/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 749us/step - accuracy: 0.6771 - loss: 0.5780 Epoch 121/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6772 - loss: 0.5776 Epoch 122/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 852us/step - accuracy: 0.6750 - loss: 0.5774 Epoch 123/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 790us/step - accuracy: 0.6764 - loss: 0.5770 Epoch 124/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 754us/step - accuracy: 0.6750 - loss: 0.5773 Epoch 125/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 774us/step - accuracy: 0.6763 - loss: 0.5765 Epoch 126/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 655us/step - accuracy: 0.6757 - loss: 0.5761 Epoch 127/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step - accuracy: 0.6766 - loss: 0.5759 Epoch 128/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 647us/step - accuracy: 0.6754 - loss: 0.5756 Epoch 129/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 580us/step - accuracy: 0.6760 - loss: 0.5753 Epoch 130/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 641us/step - accuracy: 0.6743 - loss: 0.5750 Epoch 131/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - accuracy: 0.6746 - loss: 0.5746 Epoch 132/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6741 - loss: 0.5743 Epoch 133/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6749 - loss: 0.5739 Epoch 134/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 615us/step - accuracy: 0.6749 - loss: 0.5737 Epoch 135/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6750 - loss: 0.5733 Epoch 136/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 619us/step - accuracy: 0.6756 - loss: 0.5731 Epoch 137/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 706us/step - accuracy: 0.6748 - loss: 0.5727 Epoch 138/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - accuracy: 0.6750 - loss: 0.5725 Epoch 139/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 916us/step - accuracy: 0.6764 - loss: 0.5721 Epoch 140/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step - accuracy: 0.6779 - loss: 0.5725 Epoch 141/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 691us/step - accuracy: 0.6779 - loss: 0.5717 Epoch 142/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 711us/step - accuracy: 0.6786 - loss: 0.5714 Epoch 143/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 858us/step - accuracy: 0.6765 - loss: 0.5720 Epoch 144/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 702us/step - accuracy: 0.6778 - loss: 0.5708 Epoch 145/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.6797 - loss: 0.5705 Epoch 146/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 695us/step - accuracy: 0.6746 - loss: 0.5708 Epoch 147/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 740us/step - accuracy: 0.6777 - loss: 0.5705 Epoch 148/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 733us/step - accuracy: 0.6756 - loss: 0.5697 Epoch 149/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 695us/step - accuracy: 0.6766 - loss: 0.5694 Epoch 150/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 695us/step - accuracy: 0.6761 - loss: 0.5697 Epoch 151/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.6756 - loss: 0.5688 Epoch 152/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 649us/step - accuracy: 0.6764 - loss: 0.5686 Epoch 153/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 694us/step - accuracy: 0.6753 - loss: 0.5691 Epoch 154/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 627us/step - accuracy: 0.6760 - loss: 0.5680 Epoch 155/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step - accuracy: 0.6781 - loss: 0.5677 Epoch 156/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 724us/step - accuracy: 0.6748 - loss: 0.5681 Epoch 157/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.6750 - loss: 0.5672 Epoch 158/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6747 - loss: 0.5675 Epoch 159/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.6766 - loss: 0.5666 Epoch 160/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 593us/step - accuracy: 0.6768 - loss: 0.5663 Epoch 161/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 591us/step - accuracy: 0.6768 - loss: 0.5660 Epoch 162/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.6800 - loss: 0.5657 Epoch 163/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 589us/step - accuracy: 0.6810 - loss: 0.5659 Epoch 164/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6812 - loss: 0.5652 Epoch 165/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 623us/step - accuracy: 0.6815 - loss: 0.5649 Epoch 166/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 568us/step - accuracy: 0.6830 - loss: 0.5647 Epoch 167/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 625us/step - accuracy: 0.6839 - loss: 0.5645 Epoch 168/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 598us/step - accuracy: 0.6832 - loss: 0.5646 Epoch 169/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 641us/step - accuracy: 0.6841 - loss: 0.5640 Epoch 170/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6844 - loss: 0.5642 Epoch 171/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 606us/step - accuracy: 0.6841 - loss: 0.5636 Epoch 172/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6854 - loss: 0.5641 Epoch 173/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 640us/step - accuracy: 0.6840 - loss: 0.5636 Epoch 174/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 637us/step - accuracy: 0.6841 - loss: 0.5634 Epoch 175/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 661us/step - accuracy: 0.6856 - loss: 0.5632 Epoch 176/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 622us/step - accuracy: 0.6851 - loss: 0.5628 Epoch 177/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 614us/step - accuracy: 0.6873 - loss: 0.5625 Epoch 178/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 630us/step - accuracy: 0.6858 - loss: 0.5619 Epoch 179/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.6851 - loss: 0.5624 Epoch 180/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 623us/step - accuracy: 0.6860 - loss: 0.5614 Epoch 181/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step - accuracy: 0.6858 - loss: 0.5612 Epoch 182/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 792us/step - accuracy: 0.6853 - loss: 0.5614 Epoch 183/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 608us/step - accuracy: 0.6873 - loss: 0.5615 Epoch 184/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6866 - loss: 0.5607 Epoch 185/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 694us/step - accuracy: 0.6892 - loss: 0.5604 Epoch 186/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 628us/step - accuracy: 0.6882 - loss: 0.5601 Epoch 187/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 628us/step - accuracy: 0.6886 - loss: 0.5602 Epoch 188/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - accuracy: 0.6907 - loss: 0.5596 Epoch 189/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step - accuracy: 0.6905 - loss: 0.5594 Epoch 190/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 634us/step - accuracy: 0.6904 - loss: 0.5592 Epoch 191/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6894 - loss: 0.5590 Epoch 192/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6940 - loss: 0.5587 Epoch 193/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 679us/step - accuracy: 0.6931 - loss: 0.5585 Epoch 194/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.6892 - loss: 0.5589 Epoch 195/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 682us/step - accuracy: 0.6935 - loss: 0.5580 Epoch 196/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 644us/step - accuracy: 0.6922 - loss: 0.5578 Epoch 197/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 638us/step - accuracy: 0.6930 - loss: 0.5580 Epoch 198/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 744us/step - accuracy: 0.6925 - loss: 0.5581 Epoch 199/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 626us/step - accuracy: 0.6928 - loss: 0.5572 Epoch 200/200 67/67 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step - accuracy: 0.6936 - loss: 0.5571 29/29 ━━━━━━━━━━━━━━━━━━━━ 0s 819us/step - accuracy: 0.6187 - loss: 0.6459 Accuracy on the Test Subset of Train Data: 0.6344
Optimal Model Fit¶
Now the best parameters found were of a simpler NN with 1 Layer and 22 Neurons that gave 59.5% on train subset and 63% on test subset. Using these on original/main train and test sets to evaluate model.
Classification Metrics Using K-fold CV¶
# Define function for cross-validation and evaluation on Neural Network
def cross_validate_and_evaluate_nn(best_layers, best_neurons, n_splits=5, epochs=200):
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1)
# Initialize lists to store metrics for cross-validation on the train set
fold_accuracy_train, fold_precision_train, fold_recall_train, fold_f1_train = [], [], [], []
for train_idx, val_idx in skf.split(x_train, y_train):
# Create a new model for each fold
model = create_model(best_layers, best_neurons, input_shape=x_train.shape[1])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train on the training fold
model.fit(x_train.iloc[train_idx], y_train.iloc[train_idx], epochs=epochs, verbose=0)
# Evaluate on the validation fold
y_val_pred = (model.predict(x_train.iloc[val_idx]) > 0.5).astype("int32")
# Calculate metrics for this fold
fold_accuracy_train.append(accuracy_score(y_train.iloc[val_idx], y_val_pred))
fold_precision_train.append(precision_score(y_train.iloc[val_idx], y_val_pred, zero_division=0))
fold_recall_train.append(recall_score(y_train.iloc[val_idx], y_val_pred))
fold_f1_train.append(f1_score(y_train.iloc[val_idx], y_val_pred))
# Average cross-validated metrics on the train set
avg_cv_accuracy_train = np.mean(fold_accuracy_train)
avg_cv_precision_train = np.mean(fold_precision_train)
avg_cv_recall_train = np.mean(fold_recall_train)
avg_cv_f1_train = np.mean(fold_f1_train)
# Train final model on the full training set
final_nn_model = create_model(best_layers, best_neurons, input_shape=x_train.shape[1])
final_nn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
final_nn_model.fit(x_train, y_train, epochs=epochs, verbose=1)
# Evaluate on the test set
y_test_pred = (final_nn_model.predict(x_test) > 0.5).astype("int32")
# Calculate test set metrics
accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred, zero_division=0)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)
# Create a simplified DataFrame to compare train (CV) and test metrics
nn_metrics_df = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'NN Train': [avg_cv_accuracy_train, avg_cv_precision_train, avg_cv_recall_train, avg_cv_f1_train],
'NN Test': [accuracy_test, precision_test, recall_test, f1_test]
})
# Return both the model and metrics DataFrame
return final_nn_model, nn_metrics_df
# Run the function and store the metrics DataFrame
final_nn_model, nn_metrics_df = cross_validate_and_evaluate_nn(layers, neurons)
38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 947us/step 38/38 ━━━━━━━━━━━━━━━━━━━━ 0s 897us/step Epoch 1/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 1s 617us/step - accuracy: 0.5695 - loss: 0.6850 Epoch 2/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 486us/step - accuracy: 0.6085 - loss: 0.6677 Epoch 3/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 547us/step - accuracy: 0.6168 - loss: 0.6590 Epoch 4/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 554us/step - accuracy: 0.6145 - loss: 0.6532 Epoch 5/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 522us/step - accuracy: 0.6144 - loss: 0.6486 Epoch 6/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 485us/step - accuracy: 0.6153 - loss: 0.6448 Epoch 7/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 509us/step - accuracy: 0.6178 - loss: 0.6419 Epoch 8/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 534us/step - accuracy: 0.6182 - loss: 0.6394 Epoch 9/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 489us/step - accuracy: 0.6163 - loss: 0.6371 Epoch 10/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 547us/step - accuracy: 0.6188 - loss: 0.6353 Epoch 11/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 520us/step - accuracy: 0.6218 - loss: 0.6334 Epoch 12/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 497us/step - accuracy: 0.6226 - loss: 0.6318 Epoch 13/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 492us/step - accuracy: 0.6232 - loss: 0.6304 Epoch 14/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6247 - loss: 0.6287 Epoch 15/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 771us/step - accuracy: 0.6240 - loss: 0.6273 Epoch 16/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 569us/step - accuracy: 0.6229 - loss: 0.6259 Epoch 17/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 555us/step - accuracy: 0.6239 - loss: 0.6248 Epoch 18/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 598us/step - accuracy: 0.6247 - loss: 0.6232 Epoch 19/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 499us/step - accuracy: 0.6279 - loss: 0.6219 Epoch 20/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 537us/step - accuracy: 0.6291 - loss: 0.6207 Epoch 21/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 491us/step - accuracy: 0.6324 - loss: 0.6196 Epoch 22/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 451us/step - accuracy: 0.6345 - loss: 0.6185 Epoch 23/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 485us/step - accuracy: 0.6349 - loss: 0.6174 Epoch 24/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 507us/step - accuracy: 0.6341 - loss: 0.6166 Epoch 25/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 478us/step - accuracy: 0.6376 - loss: 0.6151 Epoch 26/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 503us/step - accuracy: 0.6376 - loss: 0.6142 Epoch 27/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 506us/step - accuracy: 0.6415 - loss: 0.6132 Epoch 28/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 509us/step - accuracy: 0.6424 - loss: 0.6123 Epoch 29/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 505us/step - accuracy: 0.6414 - loss: 0.6115 Epoch 30/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 560us/step - accuracy: 0.6443 - loss: 0.6105 Epoch 31/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 482us/step - accuracy: 0.6426 - loss: 0.6096 Epoch 32/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 528us/step - accuracy: 0.6443 - loss: 0.6089 Epoch 33/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 505us/step - accuracy: 0.6453 - loss: 0.6082 Epoch 34/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 487us/step - accuracy: 0.6450 - loss: 0.6075 Epoch 35/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 530us/step - accuracy: 0.6484 - loss: 0.6069 Epoch 36/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 505us/step - accuracy: 0.6490 - loss: 0.6061 Epoch 37/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 519us/step - accuracy: 0.6500 - loss: 0.6055 Epoch 38/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 504us/step - accuracy: 0.6486 - loss: 0.6049 Epoch 39/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 514us/step - accuracy: 0.6497 - loss: 0.6043 Epoch 40/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 684us/step - accuracy: 0.6498 - loss: 0.6038 Epoch 41/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6503 - loss: 0.6033 Epoch 42/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step - accuracy: 0.6506 - loss: 0.6022 Epoch 43/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 716us/step - accuracy: 0.6498 - loss: 0.6020 Epoch 44/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 663us/step - accuracy: 0.6511 - loss: 0.6018 Epoch 45/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 684us/step - accuracy: 0.6505 - loss: 0.6013 Epoch 46/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 647us/step - accuracy: 0.6510 - loss: 0.6008 Epoch 47/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step - accuracy: 0.6518 - loss: 0.6001 Epoch 48/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 623us/step - accuracy: 0.6517 - loss: 0.6000 Epoch 49/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 554us/step - accuracy: 0.6525 - loss: 0.5995 Epoch 50/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 545us/step - accuracy: 0.6529 - loss: 0.5991 Epoch 51/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 531us/step - accuracy: 0.6526 - loss: 0.5987 Epoch 52/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 540us/step - accuracy: 0.6504 - loss: 0.5983 Epoch 53/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step - accuracy: 0.6496 - loss: 0.5980 Epoch 54/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 548us/step - accuracy: 0.6496 - loss: 0.5976 Epoch 55/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6487 - loss: 0.5971 Epoch 56/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step - accuracy: 0.6483 - loss: 0.5969 Epoch 57/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step - accuracy: 0.6501 - loss: 0.5966 Epoch 58/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step - accuracy: 0.6493 - loss: 0.5962 Epoch 59/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.6494 - loss: 0.5959 Epoch 60/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step - accuracy: 0.6520 - loss: 0.5956 Epoch 61/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 552us/step - accuracy: 0.6516 - loss: 0.5954 Epoch 62/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 568us/step - accuracy: 0.6520 - loss: 0.5951 Epoch 63/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 606us/step - accuracy: 0.6536 - loss: 0.5946 Epoch 64/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 526us/step - accuracy: 0.6532 - loss: 0.5944 Epoch 65/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 518us/step - accuracy: 0.6519 - loss: 0.5942 Epoch 66/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 639us/step - accuracy: 0.6520 - loss: 0.5938 Epoch 67/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 533us/step - accuracy: 0.6513 - loss: 0.5935 Epoch 68/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6526 - loss: 0.5933 Epoch 69/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 639us/step - accuracy: 0.6529 - loss: 0.5928 Epoch 70/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 550us/step - accuracy: 0.6530 - loss: 0.5928 Epoch 71/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 548us/step - accuracy: 0.6531 - loss: 0.5922 Epoch 72/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 553us/step - accuracy: 0.6536 - loss: 0.5922 Epoch 73/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 554us/step - accuracy: 0.6537 - loss: 0.5918 Epoch 74/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 566us/step - accuracy: 0.6528 - loss: 0.5919 Epoch 75/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 544us/step - accuracy: 0.6549 - loss: 0.5916 Epoch 76/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 625us/step - accuracy: 0.6554 - loss: 0.5914 Epoch 77/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 543us/step - accuracy: 0.6565 - loss: 0.5913 Epoch 78/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 558us/step - accuracy: 0.6566 - loss: 0.5912 Epoch 79/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6567 - loss: 0.5911 Epoch 80/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step - accuracy: 0.6573 - loss: 0.5906 Epoch 81/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step - accuracy: 0.6576 - loss: 0.5906 Epoch 82/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 666us/step - accuracy: 0.6571 - loss: 0.5904 Epoch 83/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 711us/step - accuracy: 0.6568 - loss: 0.5902 Epoch 84/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step - accuracy: 0.6560 - loss: 0.5902 Epoch 85/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 976us/step - accuracy: 0.6557 - loss: 0.5899 Epoch 86/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step - accuracy: 0.6564 - loss: 0.5898 Epoch 87/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6571 - loss: 0.5897 Epoch 88/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6576 - loss: 0.5895 Epoch 89/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6585 - loss: 0.5892 Epoch 90/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 687us/step - accuracy: 0.6577 - loss: 0.5891 Epoch 91/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 706us/step - accuracy: 0.6581 - loss: 0.5888 Epoch 92/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6580 - loss: 0.5889 Epoch 93/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 792us/step - accuracy: 0.6593 - loss: 0.5886 Epoch 94/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 909us/step - accuracy: 0.6589 - loss: 0.5885 Epoch 95/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 753us/step - accuracy: 0.6591 - loss: 0.5884 Epoch 96/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 787us/step - accuracy: 0.6596 - loss: 0.5882 Epoch 97/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.6608 - loss: 0.5878 Epoch 98/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 811us/step - accuracy: 0.6598 - loss: 0.5880 Epoch 99/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 846us/step - accuracy: 0.6601 - loss: 0.5879 Epoch 100/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 761us/step - accuracy: 0.6594 - loss: 0.5877 Epoch 101/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 778us/step - accuracy: 0.6588 - loss: 0.5876 Epoch 102/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 717us/step - accuracy: 0.6594 - loss: 0.5876 Epoch 103/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step - accuracy: 0.6592 - loss: 0.5875 Epoch 104/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 646us/step - accuracy: 0.6604 - loss: 0.5870 Epoch 105/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step - accuracy: 0.6608 - loss: 0.5872 Epoch 106/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 681us/step - accuracy: 0.6609 - loss: 0.5869 Epoch 107/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 701us/step - accuracy: 0.6610 - loss: 0.5869 Epoch 108/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 624us/step - accuracy: 0.6613 - loss: 0.5868 Epoch 109/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6615 - loss: 0.5866 Epoch 110/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 640us/step - accuracy: 0.6613 - loss: 0.5865 Epoch 111/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 615us/step - accuracy: 0.6610 - loss: 0.5862 Epoch 112/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 640us/step - accuracy: 0.6609 - loss: 0.5862 Epoch 113/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 654us/step - accuracy: 0.6600 - loss: 0.5863 Epoch 114/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 687us/step - accuracy: 0.6595 - loss: 0.5861 Epoch 115/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 654us/step - accuracy: 0.6596 - loss: 0.5860 Epoch 116/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 650us/step - accuracy: 0.6592 - loss: 0.5860 Epoch 117/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 661us/step - accuracy: 0.6593 - loss: 0.5859 Epoch 118/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 627us/step - accuracy: 0.6588 - loss: 0.5858 Epoch 119/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step - accuracy: 0.6595 - loss: 0.5857 Epoch 120/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 678us/step - accuracy: 0.6589 - loss: 0.5856 Epoch 121/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 639us/step - accuracy: 0.6589 - loss: 0.5856 Epoch 122/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 678us/step - accuracy: 0.6589 - loss: 0.5855 Epoch 123/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 653us/step - accuracy: 0.6600 - loss: 0.5854 Epoch 124/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 711us/step - accuracy: 0.6597 - loss: 0.5852 Epoch 125/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 635us/step - accuracy: 0.6600 - loss: 0.5854 Epoch 126/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step - accuracy: 0.6606 - loss: 0.5852 Epoch 127/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 745us/step - accuracy: 0.6614 - loss: 0.5847 Epoch 128/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - accuracy: 0.6611 - loss: 0.5850 Epoch 129/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6612 - loss: 0.5849 Epoch 130/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 599us/step - accuracy: 0.6615 - loss: 0.5849 Epoch 131/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.6607 - loss: 0.5847 Epoch 132/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 664us/step - accuracy: 0.6607 - loss: 0.5846 Epoch 133/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 771us/step - accuracy: 0.6609 - loss: 0.5846 Epoch 134/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step - accuracy: 0.6613 - loss: 0.5842 Epoch 135/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 701us/step - accuracy: 0.6621 - loss: 0.5844 Epoch 136/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 792us/step - accuracy: 0.6626 - loss: 0.5842 Epoch 137/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step - accuracy: 0.6627 - loss: 0.5843 Epoch 138/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 695us/step - accuracy: 0.6617 - loss: 0.5841 Epoch 139/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.6600 - loss: 0.5840 Epoch 140/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 606us/step - accuracy: 0.6596 - loss: 0.5840 Epoch 141/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 593us/step - accuracy: 0.6594 - loss: 0.5840 Epoch 142/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 587us/step - accuracy: 0.6606 - loss: 0.5838 Epoch 143/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 645us/step - accuracy: 0.6604 - loss: 0.5835 Epoch 144/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 664us/step - accuracy: 0.6603 - loss: 0.5834 Epoch 145/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 636us/step - accuracy: 0.6608 - loss: 0.5836 Epoch 146/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 637us/step - accuracy: 0.6604 - loss: 0.5832 Epoch 147/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 660us/step - accuracy: 0.6609 - loss: 0.5833 Epoch 148/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 714us/step - accuracy: 0.6620 - loss: 0.5832 Epoch 149/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 663us/step - accuracy: 0.6614 - loss: 0.5832 Epoch 150/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 706us/step - accuracy: 0.6624 - loss: 0.5828 Epoch 151/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 716us/step - accuracy: 0.6617 - loss: 0.5828 Epoch 152/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 666us/step - accuracy: 0.6619 - loss: 0.5828 Epoch 153/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 627us/step - accuracy: 0.6623 - loss: 0.5829 Epoch 154/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 578us/step - accuracy: 0.6615 - loss: 0.5830 Epoch 155/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 621us/step - accuracy: 0.6620 - loss: 0.5827 Epoch 156/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 618us/step - accuracy: 0.6614 - loss: 0.5827 Epoch 157/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step - accuracy: 0.6617 - loss: 0.5827 Epoch 158/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step - accuracy: 0.6613 - loss: 0.5826 Epoch 159/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 645us/step - accuracy: 0.6614 - loss: 0.5822 Epoch 160/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 608us/step - accuracy: 0.6606 - loss: 0.5825 Epoch 161/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 616us/step - accuracy: 0.6618 - loss: 0.5824 Epoch 162/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 591us/step - accuracy: 0.6619 - loss: 0.5822 Epoch 163/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 618us/step - accuracy: 0.6628 - loss: 0.5820 Epoch 164/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 614us/step - accuracy: 0.6626 - loss: 0.5820 Epoch 165/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 620us/step - accuracy: 0.6632 - loss: 0.5819 Epoch 166/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 614us/step - accuracy: 0.6628 - loss: 0.5815 Epoch 167/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 622us/step - accuracy: 0.6630 - loss: 0.5817 Epoch 168/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 601us/step - accuracy: 0.6629 - loss: 0.5817 Epoch 169/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 599us/step - accuracy: 0.6634 - loss: 0.5815 Epoch 170/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step - accuracy: 0.6629 - loss: 0.5816 Epoch 171/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6622 - loss: 0.5815 Epoch 172/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 616us/step - accuracy: 0.6615 - loss: 0.5814 Epoch 173/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 603us/step - accuracy: 0.6624 - loss: 0.5813 Epoch 174/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 570us/step - accuracy: 0.6628 - loss: 0.5813 Epoch 175/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 548us/step - accuracy: 0.6629 - loss: 0.5813 Epoch 176/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 550us/step - accuracy: 0.6626 - loss: 0.5812 Epoch 177/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 547us/step - accuracy: 0.6623 - loss: 0.5812 Epoch 178/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 621us/step - accuracy: 0.6629 - loss: 0.5811 Epoch 179/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step - accuracy: 0.6621 - loss: 0.5809 Epoch 180/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 713us/step - accuracy: 0.6615 - loss: 0.5810 Epoch 181/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 651us/step - accuracy: 0.6603 - loss: 0.5810 Epoch 182/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 650us/step - accuracy: 0.6601 - loss: 0.5807 Epoch 183/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 629us/step - accuracy: 0.6592 - loss: 0.5810 Epoch 184/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 630us/step - accuracy: 0.6597 - loss: 0.5809 Epoch 185/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 596us/step - accuracy: 0.6601 - loss: 0.5809 Epoch 186/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 710us/step - accuracy: 0.6594 - loss: 0.5807 Epoch 187/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.6596 - loss: 0.5808 Epoch 188/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 638us/step - accuracy: 0.6603 - loss: 0.5805 Epoch 189/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 923us/step - accuracy: 0.6608 - loss: 0.5805 Epoch 190/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 713us/step - accuracy: 0.6618 - loss: 0.5804 Epoch 191/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 766us/step - accuracy: 0.6623 - loss: 0.5802 Epoch 192/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 691us/step - accuracy: 0.6626 - loss: 0.5803 Epoch 193/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step - accuracy: 0.6619 - loss: 0.5800 Epoch 194/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step - accuracy: 0.6630 - loss: 0.5801 Epoch 195/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step - accuracy: 0.6625 - loss: 0.5801 Epoch 196/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step - accuracy: 0.6625 - loss: 0.5800 Epoch 197/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 648us/step - accuracy: 0.6629 - loss: 0.5796 Epoch 198/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 728us/step - accuracy: 0.6637 - loss: 0.5798 Epoch 199/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 715us/step - accuracy: 0.6637 - loss: 0.5799 Epoch 200/200 189/189 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step - accuracy: 0.6629 - loss: 0.5798 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
# Now display the DataFrame with the performance metrics
print("Performance Metrics of Neural Network:")
nn_metrics_df
Performance Metrics of Neural Network:
| Metric | NN Train | NN Test | |
|---|---|---|---|
| 0 | Accuracy | 0.6156 | 0.5856 |
| 1 | Precision | 0.6435 | 0.6320 |
| 2 | Recall | 0.7834 | 0.7148 |
| 3 | F1 Score | 0.7064 | 0.6709 |
The model achieved a cross-validated accuracy of 61.56% on the full training set, and a final test accuracy of around 58.5% after tuning on a subset. That is a considerable amount of drop from train to test accuracy. Further fine-tuning with regularization techniques or optimizing hyperparameters on the full dataset may slightly improve the results, but overall, the model's performance is fine given the time-efficient tuning strategy. Also early stopping can also be inlcuded as I noticed that the loss and accurcay has stated to stagnate after a while in testing.
Confusion Matrix¶
def confusion_matrix_with_counts_and_percentage_keras(model, predictors, target, threshold=0.5):
"""
Function to compute and plot the confusion matrix for a Keras classification model with both counts and percentages.
model: Keras classifier model
predictors: independent variables (features)
target: dependent variable (actual labels)
threshold: threshold for classifying the observation as class 1
"""
# Get the predictions
pred_prob = model.predict(predictors)
# Convert probabilities to class labels based on the threshold
pred = np.where(pred_prob > threshold, 1, 0)
# Compute confusion matrix
cm = confusion_matrix(target, pred)
# Compute percentages
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
# Create an annotation matrix with counts and percentages
annot = np.empty_like(cm).astype(str)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
annot[i, j] = f'{cm[i, j]}\n{cm_percent[i, j]:.2f}%'
# Plot the confusion matrix with annotations for both counts and percentages
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', cbar=False,
xticklabels=[0, 1], yticklabels=[0, 1])
plt.title('Confusion Matrix with Counts and Percentages')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
return cm
# Call the modified function for Keras models to display the confusion matrix
confusion_matrix_with_counts_and_percentage_keras(final_nn_model, x_test, y_test)
48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 680us/step
array([[247, 372],
[255, 639]], dtype=int64)
Feature/Permutation Importance¶
# Step 1: Define a scoring function to evaluate the model's accuracy
def model_score(final_nn_model, x, y):
pred_prob = final_nn_model.predict(x)
pred = np.where(pred_prob > 0.5, 1, 0)
return accuracy_score(y, pred)
# Step 2: Calculate permutation importance
perm_importance = permutation_importance(final_nn_model, x_test, y_test, n_repeats=10, scoring=model_score, random_state=1)
48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 568us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 652us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 749us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 693us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 603us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 554us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 577us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 494us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 534us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 549us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 467us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 471us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 704us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 482us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 485us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 468us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 496us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 539us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 610us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 466us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 512us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 464us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 488us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 491us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 452us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 515us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 474us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 488us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 684us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 573us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 544us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 560us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 597us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 523us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 515us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 604us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 596us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 681us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 625us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 501us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 508us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 501us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 541us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 588us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 588us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 559us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 574us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 507us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 549us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 503us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 547us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 518us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 467us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 530us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 487us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 551us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 605us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 484us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 632us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 804us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 644us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 676us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 698us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 699us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 738us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 974us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 720us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 723us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 734us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 582us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 600us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 655us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 685us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 650us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 683us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 800us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 845us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 769us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 735us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 673us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 708us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 765us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 652us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 789us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 616us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 573us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 587us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 553us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 725us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 579us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 509us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 467us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 551us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 513us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 718us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 503us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 502us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 631us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 758us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 437us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 532us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 602us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 508us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 496us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 698us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 698us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 551us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 619us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 659us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 571us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 904us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 823us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 784us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 796us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 790us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 832us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 849us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 899us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 839us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 862us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 933us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 806us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 679us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 679us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 647us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 679us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 700us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 690us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 679us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 766us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 806us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 827us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 700us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 785us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 854us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 752us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 823us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 709us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 745us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 611us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 740us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 688us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 683us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 539us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 451us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 479us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 596us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 564us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 563us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 564us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 720us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 655us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 683us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 799us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 767us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 760us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 769us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 778us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 668us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 656us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 740us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 633us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 674us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 756us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 715us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 744us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 867us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 774us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 829us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 788us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 749us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 764us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 657us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 777us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 855us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 812us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 826us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 873us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 831us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 736us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 870us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 782us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 713us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 802us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 776us/step 48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 901us/step
# Step 3: Extract feature importance and plot
sorted_idx = perm_importance.importances_mean.argsort()
plt.figure(figsize=(10, 8))
plt.barh(x_test.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
plt.ylabel("Features")
plt.title("Permutation Feature Importance for Neural Network")
plt.show()
# Step 4: Create a DataFrame for feature importance
nn_feature_importance_df = pd.DataFrame({
'Feature': x_test.columns,
'Importance': perm_importance.importances_mean
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Display the feature importance DataFrame
nn_feature_importance_df
| Feature | Importance | |
|---|---|---|
| 0 | CUMSALES | 0.0353 |
| 1 | LOYALTY_GROUP_Loyal | 0.0329 |
| 2 | MOSTUSED_PLATFORM_Web | 0.0223 |
| 3 | LOYALTY_GROUP_Ocasional | 0.0136 |
| 4 | FREQUENCY | 0.0109 |
| 5 | MARITAL_STATUS_Single | 0.0100 |
| 6 | AVERAGE_TICKET | 0.0089 |
| 7 | LOYALTY_GROUP_Vip | 0.0066 |
| 8 | LOYALTY_GROUP_Split | 0.0050 |
| 9 | PRICE_GROUP_Moderately Price Insensitive | 0.0046 |
| 10 | PRICE_GROUP_Selective Price Sensitive | 0.0044 |
| 11 | CONSISTENCY | 0.0039 |
| 12 | MOSTUSED_PLATFORM_Mobile | 0.0036 |
| 13 | GENDER_Male | 0.0020 |
| 14 | MARITAL_STATUS_Divorced | 0.0012 |
| 15 | PRICE_GROUP_Very Price Insensitive | -0.0007 |
| 16 | PRICE_GROUP_Very Price Sensitive | -0.0015 |
| 17 | AGE | -0.0020 |
| 18 | GENDER_Female | -0.0023 |
| 19 | MARITAL_STATUS_Married | -0.0034 |
| 20 | PRICE_GROUP_Moderately Price Sensitive | -0.0038 |
| 21 | RECENCY | -0.0095 |
# Function to plot a confusion matrix with TP, FP, TN, FN labels and row-wise percentages
def plot_confusion_matrix_with_labels(ax, model, X, y_true, model_name, is_nn=False):
# For NN models, convert probabilities to class predictions
if is_nn:
y_pred_prob = model.predict(X)
y_pred = (y_pred_prob > 0.5).astype(int) # Threshold for binary classification
else:
y_pred = model.predict(X)
cm = confusion_matrix(y_true, y_pred)
# Normalize the confusion matrix row-wise (i.e., per actual class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# Calculate TP, TN, FP, FN
tn, fp, fn, tp = cm.ravel()
# Create annotation labels with counts and row-wise percentages
annot = [[f'TN={tn}\n{cm_normalized[0, 0]:.2%}', f'FP={fp}\n{cm_normalized[0, 1]:.2%}'],
[f'FN={fn}\n{cm_normalized[1, 0]:.2%}', f'TP={tp}\n{cm_normalized[1, 1]:.2%}']]
# Plot confusion matrix in the provided axes object
sns.heatmap(cm, annot=annot, fmt='', cmap='Blues', cbar=False, ax=ax, annot_kws={"size": 12})
ax.set_title(f'{model_name}', fontsize=14)
ax.set_ylabel('Actual', fontsize=12)
ax.set_xlabel('Predicted', fontsize=12)
# Create a 2x3 grid for confusion matrices
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# Plot each model's confusion matrix in a subplot
# Lasso
plot_confusion_matrix_with_labels(axes[0, 0], final_logreg_lasso, x_test, y_test, "Lasso")
# Random Forest
plot_confusion_matrix_with_labels(axes[0, 1], optim_rf_classifier, x_test, y_test, "Random Forest")
# Gradient Boosting
plot_confusion_matrix_with_labels(axes[0, 2], optim_gb_classifier, x_test, y_test, "Gradient Boosting")
# SVM
plot_confusion_matrix_with_labels(axes[1, 0], final_svm_poly, x_test, y_test, "SVM")
# Neural Network (set is_nn=True to handle NN predictions)
plot_confusion_matrix_with_labels(axes[1, 1], final_nn_model, x_test, y_test, "Neural Network", is_nn=True)
# Adjust layout for clarity
plt.tight_layout()
plt.show()
48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 824us/step
# Print the class assignment for SEGMENT_1
print(f"Class mapping for 'SEGMENT_1': {le.classes_[0]} -> 0, {le.classes_[1]} -> 1")
Class mapping for 'SEGMENT_1': Core -> 0, Up -> 1
While the True Positive (TP) rates are impressively high across all models, especially for Gradient Boosting at 98.10%, the True Negative (TN) rates are notably lower, indicating a significant number of False Positives (FP). This imbalance suggests that while the models are highly sensitive, they may be over-predicting the positive class. Adjusting the classification cut-off thresholds could help better balance the trade-off between TP and TN, potentially reducing the high FP rate and improving overall model performance.
5.2 Final Classification Metrics¶
# Function to compute all the necessary metrics from the confusion matrix
def compute_metrics(y_true, y_pred):
# Calculate confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
# Calculate rates
accuracy = accuracy_score(y_true, y_pred)
tp_rate = recall_score(y_true, y_pred) # Recall is the TP Rate
tn_rate = tn / (tn + fp) # TN Rate (Specificity)
fp_rate = fp / (fp + tn) # FP Rate
fn_rate = fn / (fn + tp) # FN Rate
precision = precision_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
# Return the metrics in a dictionary
return {
'Accuracy': accuracy,
'TP Rate (Recall)': tp_rate,
'TN Rate (Specificity)': tn_rate,
'FP Rate': fp_rate,
'FN Rate': fn_rate,
'Precision': precision,
'Recall (TP Rate)': tp_rate, # Repeated for clarity
'F1-Score': f1
}
# Create a function to get predictions and compute metrics for each model
def get_metrics_for_model(model, X_test, y_test, model_name, is_nn=False):
if is_nn:
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int) # Convert probabilities to binary predictions
else:
y_pred = model.predict(X_test)
metrics = compute_metrics(y_test, y_pred)
return model_name, metrics
# Collect metrics for each model
metrics_data = {}
metrics_data['Lasso'] = get_metrics_for_model(final_logreg_lasso, x_test, y_test, "Lasso")[1]
metrics_data['Random Forest'] = get_metrics_for_model(optim_rf_classifier, x_test, y_test, "Random Forest")[1]
metrics_data['Gradient Boosting'] = get_metrics_for_model(optim_gb_classifier, x_test, y_test, "Gradient Boosting")[1]
metrics_data['SVM'] = get_metrics_for_model(final_svm_poly, x_test, y_test, "SVM")[1]
metrics_data['Neural Network'] = get_metrics_for_model(final_nn_model, x_test, y_test, "Neural Network", is_nn=True)[1]
# Convert the dictionary to a DataFrame for better visualization
metrics_df = pd.DataFrame(metrics_data).T
# Display the DataFrame
metrics_df
48/48 ━━━━━━━━━━━━━━━━━━━━ 0s 862us/step
| Accuracy | TP Rate (Recall) | TN Rate (Specificity) | FP Rate | FN Rate | Precision | Recall (TP Rate) | F1-Score | |
|---|---|---|---|---|---|---|---|---|
| Lasso | 0.6067 | 0.9284 | 0.1422 | 0.8578 | 0.0716 | 0.6098 | 0.9284 | 0.7361 |
| Random Forest | 0.6385 | 0.9407 | 0.2019 | 0.7981 | 0.0593 | 0.6300 | 0.9407 | 0.7546 |
| Gradient Boosting | 0.6345 | 0.9810 | 0.1341 | 0.8659 | 0.0190 | 0.6207 | 0.9810 | 0.7603 |
| SVM | 0.6319 | 0.9374 | 0.1906 | 0.8094 | 0.0626 | 0.6258 | 0.9374 | 0.7506 |
| Neural Network | 0.5856 | 0.7148 | 0.3990 | 0.6010 | 0.2852 | 0.6320 | 0.7148 | 0.6709 |
What is the Business Goal here?¶
Assumption:
With no clear documentation available for the dataset variables, I referred to common customer segmentation practices outlined in (The Good-Better-Best Approach to Pricing, 2018). Based on this model, it’s reasonable to assume that the 'Core' segment represents regular customers who prefer basic products, while the 'Up' segment includes premium customers who opt for higher-end or better-value offerings. 'Up' customers often become the focus of loyalty programs, personalized marketing, and upselling strategies due to their higher lifetime value. Understanding the distinction between 'Core' and 'Up' helps businesses refine strategies tailored to each customer segment.
Alligning with this common industry practices, here is the possible business goal of building this classification model-
For this project, the focus is on correctly identifying as many 'Up' customers as possible, since they represent a higher-value segment with more potential for revenue through personalized marketing, loyalty programs, and upselling. It is tolerable to misclassify 'Core' customers as 'Up', as the business impact of sending marketing or upselling efforts to the 'Core' group is relatively low compared to losing an 'Up' customer by classifying them as 'Core' (which could mean missed revenue opportunities).
Thus, in this context:
1). Maximizing Recall (i.e., capturing as many actual 'Up' customers as possible- True Positives) is the top priority.
2). Misclassifying 'Core' as 'Up' (false positives) is acceptable/ tolerable.
3). Good balance in F-1 Score and lesser False Negative Rate is benificial.
What is the Use of this Classification Model?¶
Considering the above mentioned business goal, here is how this E-Commerce Classification Model can improve the business-
For this e-commerce dataset, where 0 is 'Core' (40%) and 1 is 'Up' (60%), the most critical metrics are Recall (TP Rate) and F1-Score. These are vital because the business likely cares more about correctly identifying as many 'Up' customers (majority class) as possible. In the best-performing model, Gradient Boosting, Recall stands at 98.10%, indicating that the model captures most of the 'Up' customers, minimizing the risk of missing revenue opportunities.
Precision (how many predicted 'Up' customers are correct) is also important, but the cost of sending marketing material to the wrong 'Core' customers might not be as significant as missing the 'Up' ones. For Gradient Boosting, Precision is 62.07%, reflecting a good balance but showing some trade-off with Precision for higher Recall. Thus, Recall takes precedence over Precision.
The False Negative Rate (FNR), which is the rate of misclassifying 'Up' customers as 'Core', is crucial as well. Gradient Boosting's FNR is 1.90%, indicating the model is very effective at minimizing the loss of potential high-value customers.
F1-Score, as a balance between Precision and Recall, provides a comprehensive view of model performance. Gradient Boosting has the highest F1-Score at 76.03%, making it a highly valuable metric in this context.
On the other hand, Accuracy alone is not very reliable due to the class imbalance (60-40 split). A model could achieve high accuracy by focusing on the majority class ('Up'), but this would overlook the importance of correctly identifying 'Core' customers. The best model, Gradient Boosting, has an accuracy of 63.45%, but this doesn't capture the full story in the presence of imbalance.
In summary, the focus should be on Recall (98.10%), F1-Score (76.03%), and False Negative Rate (1.90%) for the best model, Gradient Boosting, as these metrics align with the goal of identifying as many 'Up' customers as possible while minimizing lost opportunities due to misclassification.
Which is the Best Performing Model?¶
Gradient Boosting (GB) stands out as the best-performing model for this dataset, with higher Recall and F1-Score compared to Logistic Regression, Random Forest, and SVM. This means GB is more effective at correctly identifying 'Up' customers while maintaining a strong balance between Precision and Recall. While Random Forest also performs well, GB's ability to handle the class imbalance and provide more accurate predictions makes it the optimal choice for this segmentation task.
This Model predicts TP well, what about TN?¶
However, if True Negatives (TN)—that is, correctly identifying 'Core' customers—are equally important, given that 'Core' represents a significant 40% of the customer base, adjusting the probability threshold for classification could help balance the trade-off between True Positives (Up customers) and True Negatives (Core customers). By altering the default threshold from 50% to a lower or higher value, the business can better tune the model to either prioritize Core or Up customers based on specific business goals, such as customer retention or upselling.
Can this Model Allign to a Different Business Goal?¶
Also, on this similar ideology, if the business goal is different, then the model that performs well on those related metrics can be chosen over others. For example, if the goal is to minimize false positives (FP) — misclassifying 'Core' customers as 'Up' — then Random Forest may be a better option, as it can handle non-linear relationships more effectively compared to Logistic Regression and does the job. However, if the focus is maximizing F1-Score for a balance between precision and recall, Gradient Boosting would still be the optimal choice. At the end, it is always a trade-off in the field of Data Science or for that matter business as well.
5.3 Final Feature Importance¶
Visual¶
# Random Forest: RF_coefficients_df
# Lasso: lasso_coefficients_df
# Gradient Boosting: GB_coefficients_df
# SVM: perm_importances (for permutation importance)
# Neural Network: nn_feature_importance_df (calculated using permutation importance)
# Reindex all feature importance DataFrames to align with Random Forest order
feature_order = RF_coefficients_df['Feature']
# Now ensure all other models follow the same feature order
lasso_coefficients_df = lasso_coefficients_df.set_index('Feature').reindex(feature_order).reset_index()
GB_coefficients_df = GB_coefficients_df.set_index('Feature').reindex(feature_order).reset_index()
perm_importances = perm_importances.set_index('Feature').reindex(feature_order).reset_index() # SVM permutation importance
nn_feature_importance_df = nn_feature_importance_df.set_index('Feature').reindex(feature_order).reset_index()
# Plot feature importance for all models, including the Neural Network
fig, axs = plt.subplots(3, 2, figsize=(15, 18)) # Adjusting layout to fit NN
# Random Forest (Top Left)
axs[0, 0].barh(RF_coefficients_df['Feature'], RF_coefficients_df['Importance'], color='skyblue')
axs[0, 0].set_title('Random Forest Feature Importance')
axs[0, 0].set_xlabel('Importance')
axs[0, 0].set_ylabel('Features')
# Lasso (Top Right)
axs[0, 1].barh(lasso_coefficients_df['Feature'], lasso_coefficients_df['Coefficient'], color='lightgreen')
axs[0, 1].set_title('Lasso Feature Importance')
axs[0, 1].set_xlabel('Coefficient')
axs[0, 1].set_yticks([])
# Gradient Boosting (Middle Left)
axs[1, 0].barh(GB_coefficients_df['Feature'], GB_coefficients_df['Importance'], color='coral')
axs[1, 0].set_title('Gradient Boosting Feature Importance')
axs[1, 0].set_xlabel('Importance')
axs[1, 0].set_ylabel('Features')
# SVM (Middle Right) - using permutation importance
axs[1, 1].barh(perm_importances['Feature'], perm_importances['Importance'], color='lightcoral')
axs[1, 1].set_title('SVM (Permutation Importance)')
axs[1, 1].set_xlabel('Importance')
axs[1, 1].set_yticks([])
# Neural Network (Bottom Left)
axs[2, 0].barh(nn_feature_importance_df['Feature'], nn_feature_importance_df['Importance'], color='lightblue')
axs[2, 0].set_title('Neural Network Feature Importance')
axs[2, 0].set_xlabel('Permutation Importance')
axs[2, 0].set_ylabel('Features')
# Adjust layout for clarity
plt.tight_layout()
plt.show()
Predictors like cummulative sales, consistency, average_ticket, loyalty group stood out with more importance, whereas marital status, price group, age stood out least important across all models.
Ranking Table¶
# Reindex all feature importance DataFrames to align with Gradient Boosting order
feature_order = GB_coefficients_df['Feature']
# Now ensure all other models follow the same feature order
RF_coefficients_df = RF_coefficients_df.set_index('Feature').reindex(feature_order).reset_index()
lasso_coefficients_df = lasso_coefficients_df.set_index('Feature').reindex(feature_order).reset_index()
perm_importances = perm_importances.set_index('Feature').reindex(feature_order).reset_index() # SVM permutation importance
nn_feature_importance_df = nn_feature_importance_df.set_index('Feature').reindex(feature_order).reset_index()
# Create a DataFrame to aggregate ranks
rank_df = pd.DataFrame({
'Feature': feature_order,
'Gradient Boosting Rank': GB_coefficients_df['Importance'].rank(ascending=False),
'Random Forest Rank': RF_coefficients_df['Importance'].rank(ascending=False),
'Lasso Rank': lasso_coefficients_df['Coefficient'].rank(ascending=False),
'SVM Rank': perm_importances['Importance'].rank(ascending=False),
'Neural Network Rank': nn_feature_importance_df['Importance'].rank(ascending=False)
})
# Calculate the aggregate rank (mean rank across all models) excluding the 'Feature' column
rank_df['Aggregate Rank (Mean)'] = rank_df[['Gradient Boosting Rank', 'Random Forest Rank', 'Lasso Rank', 'SVM Rank', 'Neural Network Rank']].mean(axis=1)
# Sort by the aggregate rank and reset index starting from 1
rank_df = rank_df.sort_values(by='Aggregate Rank (Mean)').reset_index(drop=True)
# Set the index starting from 1 for ranking
rank_df.index = rank_df.index + 1
# Display the final ranking DataFrame with index starting from 1
rank_df
| Feature | Gradient Boosting Rank | Random Forest Rank | Lasso Rank | SVM Rank | Neural Network Rank | Aggregate Rank (Mean) | |
|---|---|---|---|---|---|---|---|
| 1 | LOYALTY_GROUP_Loyal | 9.0000 | 6.0000 | 2.0000 | 5.0000 | 2.0000 | 4.8000 |
| 2 | CUMSALES | 1.0000 | 1.0000 | 21.0000 | 2.0000 | 1.0000 | 5.2000 |
| 3 | AVERAGE_TICKET | 7.0000 | 2.0000 | 4.0000 | 7.0000 | 7.0000 | 5.4000 |
| 4 | CONSISTENCY | 4.0000 | 3.0000 | 12.5000 | 3.0000 | 12.0000 | 6.9000 |
| 5 | FREQUENCY | 5.0000 | 8.0000 | 12.5000 | 4.0000 | 5.0000 | 6.9000 |
| 6 | LOYALTY_GROUP_Ocasional | 3.0000 | 10.0000 | 17.0000 | 1.0000 | 4.0000 | 7.0000 |
| 7 | LOYALTY_GROUP_Vip | 10.0000 | 15.0000 | 1.0000 | 8.0000 | 8.0000 | 8.4000 |
| 8 | LOYALTY_GROUP_Split | 2.0000 | 4.0000 | 22.0000 | 6.0000 | 9.0000 | 8.6000 |
| 9 | MOSTUSED_PLATFORM_Web | 8.0000 | 11.0000 | 3.0000 | 22.0000 | 3.0000 | 9.4000 |
| 10 | AGE | 6.0000 | 5.0000 | 8.0000 | 11.0000 | 18.0000 | 9.6000 |
| 11 | MOSTUSED_PLATFORM_Mobile | 11.0000 | 9.0000 | 12.5000 | 14.0000 | 13.0000 | 11.9000 |
| 12 | GENDER_Female | 14.0000 | 14.0000 | 6.0000 | 9.0000 | 19.0000 | 12.4000 |
| 13 | GENDER_Male | 13.0000 | 13.0000 | 12.5000 | 13.0000 | 14.0000 | 13.1000 |
| 14 | MARITAL_STATUS_Single | 21.0000 | 16.0000 | 12.5000 | 10.0000 | 6.0000 | 13.1000 |
| 15 | PRICE_GROUP_Moderately Price Insensitive | 17.0000 | 17.0000 | 5.0000 | 19.0000 | 10.0000 | 13.6000 |
| 16 | RECENCY | 12.0000 | 7.0000 | 18.0000 | 12.0000 | 22.0000 | 14.2000 |
| 17 | PRICE_GROUP_Selective Price Sensitive | 16.0000 | 20.0000 | 7.0000 | 17.0000 | 11.0000 | 14.2000 |
| 18 | PRICE_GROUP_Very Price Insensitive | 15.0000 | 18.0000 | 20.0000 | 15.0000 | 16.0000 | 16.8000 |
| 19 | MARITAL_STATUS_Married | 19.0000 | 12.0000 | 19.0000 | 16.0000 | 20.0000 | 17.2000 |
| 20 | PRICE_GROUP_Very Price Sensitive | 21.0000 | 19.0000 | 12.5000 | 18.0000 | 17.0000 | 17.5000 |
| 21 | MARITAL_STATUS_Divorced | 21.0000 | 22.0000 | 12.5000 | 21.0000 | 15.0000 | 18.3000 |
| 22 | PRICE_GROUP_Moderately Price Sensitive | 18.0000 | 21.0000 | 12.5000 | 20.0000 | 21.0000 | 18.5000 |
Why Feature Importance Ranking on 'Mean' is better than Sum?¶
In this case, using the mean rank is more appropriate as it balances the importance of features across both simple linear models like Lasso and more complex, non-linear models like Gradient Boosting and Random Forest. Since your dataset shows non-linear relationships, mean rank provides a fair representation of each feature's overall significance across different types of models. The sum rank might exaggerate features that are only strong in specific models, making the mean rank a better choice for balanced insights.
6- Conclusions¶
Parametric Vs Non-Parametric Models- Which are Better?¶
Given the clear non-linear relationships between the outcome and predictors that we saw from bivariate plots, non-parametric models like SVM (with non-linear kernels), Random Forests, Gradient Boosting, and Neural Networks outperform Logistic Regression and Lasso, as these capture complex relationships better by handling the variance more effectively. Logistic Regression and Lasso are biased towards simpler, linear relationships and may struggle with underfitting due to their inability to capture non-linear patterns effectively.
Is building a more Interpretable Models with Higher Performance Possible?¶
Across all models, predictors like Average_ticket (which is essentially capturing the same information as cumulative sales), Loyalty_group, Consistency, Cumulative_sales, and Frequency consistently stood out as the most important features. On the other hand, Marital_status and Price_group were consistently ranked as the least important across all models.
Interestingly, the Chi-square test and correlations conducted in the Bi-variate analysis indicated that marital_status and age were not significant and had very low correlations with the target variable, segment_1. However, the final model’s feature importance rankings ranked Age feature as good enough (average rank) across all models and the Price_Group, which initially appeared significant with a high correlation, turned out to be less important across all models. This suggests the potential presence of multicollinearity among features, where similar features like loyalty group may have overshadowed the price group.
PCA can be looked into & Cluster Analysis is highly recommended to find out possible features which are grouped together, so that in the future, a more interpretable model if not efficient model can be built by dropping the less important/redundant ones from the model by focusing on the more important ones with higher predictive power.
PCA & Clustering Proposal¶
Here are the comments on PCA and clustering and how incorporating it in the project would result in enhanced models.
PCA: Conducting PCA on the predictor variables makes sense as it allows us to reduce dimensionality by retaining 4 principal components that explain 98% of the total variance, while dropping 2 components that account for less than 2%. Although this reduces interpretability, the goal is to improve predictive performance rather than focus on individual feature contributions. By applying PCA, we streamline the dataset and potentially improve model efficiency without significant information loss.
Clustering: Given the feature importance findings, clustering can help address overlaps between features like loyalty group and price group, which may exhibit multicollinearity. By grouping customers into clusters based on patterns across all features, we can capture underlying relationships not easily distinguished in feature importance. These clusters could serve as new, meaningful features in the model, highlighting segments of customers where certain features are more relevant, improving model accuracy. Additionally, with all our models heavily reliant on categorical features, clustering can help simplify the model by reducing the impact of categorical features with many classes, minimizing unnecessary complexity while maintaining performance.
Overall, it is better to conduct PCA and it is highly recomended to do cluster analysis for our model needs.
What is the ranking of these Models- Performance Perspective?¶
1). Gradient Boosting offers the best performance with Recall (0.9810) and F1-Score (0.7603). It’s ideal for identifying true positives but has a false positive rate of 0.8659, which could be improved.
2). Random Forest balances well between Precision (0.6300) and Recall (0.9407), with a slightly lower F1-Score (0.7546) than Gradient Boosting, making it a fast, reliable option.
3). SVM shows moderate results with Precision (0.6258) and Recall (0.9374), but its F1-Score (0.7506) suggests it lags slightly behind the top two models.
4). Lasso is the simplest but performs weaker, with a Recall of 0.9284 and F1-Score of 0.7361, making it less suited for non-linear relationships.
5). Neural Networks have a lower F1-Score (0.6709) and the highest false negative rate (0.2852), indicating the need for tuning or feature engineering.
In summary, Gradient Boosting is the top performer, while Random Forest is a solid, faster alternative.
What is the ranking of these Models- Business Perspective?¶
# Creating the consolidated metrics DataFrame for all models
# Concatenating all DataFrames to form a single table
consolidated_metrics_df = pd.concat([lasso_metrics_df.set_index('Metric'),
rf_metrics_df.set_index('Metric'),
gb_metrics_df.set_index('Metric'),
svm_metrics_df.set_index('Metric'),
nn_metrics_df.set_index('Metric')],
axis=1)
# Display the consolidated DataFrame
print("Consolidated Performance Metrics for All Models:")
consolidated_metrics_df
Consolidated Performance Metrics for All Models:
| Lasso Train | Lasso Test | RF Train | RFTest | GB Train | GB Test | SVM Train | SVM Test | NN Train | NN Test | |
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ||||||||||
| Accuracy | 0.6171 | 0.6067 | 0.6262 | 0.6385 | 0.6379 | 0.6345 | 0.6305 | 0.6319 | 0.6156 | 0.5856 |
| Precision | 0.6148 | 0.6098 | 0.6241 | 0.6300 | 0.6233 | 0.6207 | 0.6241 | 0.6258 | 0.6435 | 0.6320 |
| Recall | 0.9421 | 0.9284 | 0.9239 | 0.9407 | 0.9787 | 0.9810 | 0.9421 | 0.9374 | 0.7834 | 0.7148 |
| F1 Score | 0.7440 | 0.7361 | 0.7449 | 0.7546 | 0.7615 | 0.7603 | 0.7508 | 0.7506 | 0.7064 | 0.6709 |
Given the context of the dataset, where we’re dealing with non-linear relationships and slight class imbalance, here is how a model choice can be made based on the situation and necessity:
1). Best Predictive Performance (Testing Set Alone):
Gradient Boosting emerges as the top choice, with the highest recall (0.9810) and F1-score (0.7603) on the test set. This makes it highly suited for capturing the complex, non-linear relationships present in our dataset. Since the goal is maximizing predictive performance, Gradient Boosting’s ability to handle class imbalance and offer strong performance without heavy overfitting is crucial.
2). Difference Between Training and Testing Set (Least Overfitting Tendency):
Random Forest shows a more balanced performance across training and testing sets, with an accuracy difference of only around 1% (0.6262 on train vs. 0.6385 on test). While Gradient Boosting performs slightly better overall, Random Forest demonstrates lower overfitting, making it a strong contender if we want robust generalization to new data.
3). Good Enough Solution Quickly:
Lasso Regression is the best for quick solutions. With a relatively balanced performance (Accuracy of 60.54% on test, F1-score 0.7230 on test), it runs faster and is easier to interpret compared to more complex models. While it doesn't fully capture the non-linearities as well as Gradient Boosting or Random Forest, it still provides an adequate solution with good precision and recall, making it a good fallback option when speed and simplicity are prioritized.
References¶
The Good-Better-Best Approach to Pricing. (2018, August 21). Harvard Business Review. https://hbr.org/2018/09/the-good-better-best-approach-to-pricing
Wilber, J., & Werness, B. (2021, January). Bias Variance Tradeoff. https://mlu-explain.github.io/bias-variance/
MLU-Explain. (n.d.). https://mlu-explain.github.io/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. https://www.statlearning.com/